11.2 — Interpolation, Curve fitting, Numerical Integration

Motivation

  • We often want to analyze data points in the form shown below, where we have some dependent variables yiy_i corresponding to some independent variables xix_i.
x0x_0
x1x_1
x2x_2
......
xnx_n
y0y_0
y1y_1
y2y_2
......
yny_n
  • The goal is to find a function f(x)f(x) that either
    • passes through all the points exactly (interpolation)
    • approximates data in a “best” possible way but not exactly (curve fitting)

Image from Kiusalaas, chap. 3.

How interpolation and curve-fitting are useful?

  • If we can find a function f(x)f(x) that interpolates or approximates the data well,
    • We can try to predict new data points (xk,yk)(x_k, y_k).
    • We can better understand the underlying process that might have generated the data.

Polynomial interpolation

  • The simplest form of interpolation function is a polynomial function.
  • It is always possible to construct a polynomial P(x)P(x) of degree nn that passes through n+1n+1 data points.

Lagrange’s method

  • Suppose we have a dataset of n+1n+1 points:

    (x0,y0),(x1,y1),(x2,y2),...,(xn,yn) (x_0, y_0), (x_1, y_1), (x_2, y_2), ..., (x_n, y_n)
  • We construct an interpolation polynomial of degree nn, called the Lagrange polynomial, using the formula:

    P(x)=c0(x)y0+c1(x)y1+...+cn(x)yn P(x) = c_0(x)y_0 + c_1(x)y_1 + ... + c_n(x)y_n

    where ci(x)c_i(x) are called the cardinal functions.

The cardinal functions are given by:

ci(x)=(xx0)(xix0)(xxi1)(xixi1)(xxi+1)(xixi+1)(xxn)(xixn)=j=0jin(xxj)(xixj) \begin{aligned} c_i\left( x \right) &= \frac{(x-x_0)}{(x_i-x_0)} \cdots \frac{(x-x_{i-1})}{(x_i-x_{i-1})} \cdot \frac{(x-x_{i+1})}{(x_i-x_{i+1})} \cdots \frac{(x-x_{n})}{(x_i-x_{n})}\\[1em] &= \prod_{\substack{j = 0 \\[.1em] j \neq i}}^{n} \frac{(x-x_j)}{(x_i-x_j)} \end{aligned}
  • Numerator contains all terms except (xxi)(x - x_i)
  • Denominator contains all terms except (xixi)(x_i - x_i)

Claim: P(x)P(x) passes through each data point

If P(x)P(x) passes through all points (xi,yi)(x_i, y_i), then we must have:

P(xi)=yi P(x_i) = y_i

Let’s see if this is true for the Lagrange polynomial we just found.

In the above equation for ci(x)c_i(x), setting x=xix=x_i and x=xjx=x_j, iji \ne j, we have

ci(xi)=j=0jin(xixj)(xixj)=1 c_i\left( x_i \right) = \prod_{\substack{j = 0 \\[.1em] j \neq i}}^{n} \frac{(x_i-x_j)}{(x_i-x_j)} = 1

and

ci(xj)=j=0jin(xjxj)(xixj)=0 , ji c_i\left( x_j \right) = \prod_{\substack{j = 0 \\[.1em] j \neq i}}^{n} \frac{(x_j-x_j)}{(x_i-x_j)} = 0\ ,\ j \neq i

Now, using the equation of P(x)P(x) for x=xix=x_i:

P(xi)=c0(xi)y0+...+ci(xi)yi+...+cn(xi)yn=0+...+1yi+...+0=yi\begin{aligned} P(x_i) &= c_0(x_i)y_0 & &+ ... + c_i(x_i)y_i & &+ ... + c_n(x_i)y_n \\ &= 0 & &+ ... + 1\cdot y_i & &+ ... + 0 \\ &= y_i \end{aligned}

Lagrange polynomial — Example

Consider the following dataset. We have 44 data points so the degree of the Lagrange polynomial P(x)P(x) will be n=3n=3.

xix_i
00
1010
2020
3030
yiy_i
250-250
00
5050
100-100

We first write 44 cardinal functions corresponding to these 44 data points using x-values.

c0(x) = (x10)(010)(x20)(020)(x30)(030) = 16000x3+1100x21160x+1c1(x) = (x0)(100)(x20)(1020)(x30)(1030) = 12000x3140x2+310xc2(x) = (x0)(200)(x10)(2010)(x30)(2030) = 12000x3+150x2320xc3(x) = (x0)(300)(x10)(3010)(x20)(3020) = 16000x31200x2+130x\begin{aligned} c_0(x) \ &= \ \frac{(x-10)}{(0-10)} \cdot \frac{(x-20)}{(0-20)} \cdot \frac{(x-30)}{(0-30)} \ = \ \frac{-1}{6000}x^3 + \frac{1}{100}x^2 - \frac{11}{60} x + 1\\ c_1(x) \ &= \ \frac{(x-0)}{(10-0)} \cdot \frac{(x-20)}{(10-20)} \cdot \frac{(x-30)}{(10-30)} \ = \ \frac{1}{2000}x^3 - \frac{1}{40}x^2 + \frac{3}{10} x\\ c_2(x) \ &= \ \frac{(x-0)}{(20-0)} \cdot \frac{(x-10)}{(20-10)} \cdot \frac{(x-30)}{(20-30)} \ = \ \frac{-1}{2000}x^3 + \frac{1}{50}x^2 - \frac{3}{20} x\\ c_3(x) \ &= \ \frac{(x-0)}{(30-0)} \cdot \frac{(x-10)}{(30-10)} \cdot \frac{(x-20)}{(30-20)} \ = \ \frac{1}{6000}x^3 - \frac{1}{200}x^2 + \frac{1}{30} x \end{aligned}

We then use these four cardinal functions along with the y-values yiy_i in the equation of P(x)P(x):

P(x)=c0(x)y0+c1(x)y1+...+cn(x)yn=x2+35x250\begin{aligned} P(x) &= c_0(x)y_0 + c_1(x)y_1 + ... + c_n(x)y_n \\ &= -x^2 + 35x -250 \end{aligned}

Download file lagrange_polynomial.py from Ed to plot these points and the interpolation function to check if it works.

Curve fitting

  • If data is obtained experimentally, there could be random noise present due to measurement errors.
    • Interpolation would not be a good choice as it would fit all data points including noise as well.
  • Instead, we use curve fitting, to find a smooth curve that fits the “average” of the data points, and that is less affected by the noise or “outlier” points.

  • Linear regression is a simpler case of curve fitting where a straight line is found that fits a set of data points.
  • Unlike interpolation, the line need not pass through the points.
  • Linear regression line is given by f(x)=ax+bf(x) = ax + b, where aa is the slope of the line, and bb is the y-intercept.
  • The main task will be to find aa and bb such that the line “fits” the data points in the “best” possible way.

How to determine the “best” fit?

  • To know how good a line fits the given data points, we compute error for each data point.

  • Error is defined as the squared difference between the actual y-value yiy_i and the y-value given by the regression line, f(xi)f(x_i).

The total error EE for all data points is given by

E(a,b)=i=0n(yif(xi))2=i=0n(yi(axi+b))2 \begin{aligned} E(a, b) &= \sum_{i=0}^{n} \left ( y_i - f\left ( x_i \right ) \right ) ^ 2 \\ &= \sum_{i=0}^{n} \left ( y_i - \left ( ax_i + b \right ) \right ) ^ 2 \end{aligned}

One way to define “best” fit:

  • Given all the data points (xi,yi)(x_i, y_i), we want to find a line (i.e. find aa and bb) such that the error E(a,b)E(a, b) is minimized.

To find aa and bb that minimize E(a,b)E(a, b), we calculate the partial derivatives of EE with respect to aa and bb and solve for a and b when derivates are zero:

Ea=0,Eb=0 \frac{\partial E}{\partial a} = 0,\, \, \, \, \, \frac{\partial E}{\partial b} = 0

Calculating the partial derivatives

Ea=i=0n2(yi(axi+b))(xi)\frac{\partial E}{\partial a} = \sum_{i=0}^{n} 2 (y_i - (ax_i + b)) (-x_i)
Eb=i=0n2(yi(axi+b))\frac{\partial E}{\partial b} = \sum_{i=0}^{n} 2 (y_i - (ax_i + b))

Setting above derivatives to 00 and solving for aa and bb,

a=(xixˉ)(yiyˉ)(xixˉ)2a = \frac{\sum (x_i - \bar{x}) (y_i - \bar{y}) } {\sum (x_i - \bar{x})^2 }
b=yˉaxˉb = \bar{y} - a\bar{x}

where xˉ\bar{x} and yˉ\bar{y} are the averages of x-values and y-values of data points.

Finally, using these aa and bb, we get the desired best-fit line as f(x)=ax+bf(x)=ax+b. Here aa is slope of the line and bb is the y-intercept.

Download linear_regression.py from Ed, which plots data and regression line.

Interpolation examples using Numpy/SciPy

  • numpy_linear_interpolation.py: Piecewise linear interpolation.
  • cubic_spline_interpolation.py: Cubic spline interpolation — piecewise polynomial interpolation where each piece is a cubic polynomial (a+bx+cx2+dx3a+ b x + c x^2 + d x^3, d0d \ne 0)

Definite Integral

  • The definite integral of a function of a single variable, f(x)f(x), between two limits aa and bb can be viewed as the area SS under the curve.

  • Numerical integration algorithms try to estimate this area

Numerical Integration

Our approach will be as follows:

  • Divide the region between aa and bb into NN segments
  • We then estimate the area under the curve in each segment
  • Finally, we sum these areas

We will consider three algorithms for estimating this area in each segment:

  • The Midpoint method
  • The Trapezoidal method
  • Simpson’s method

Example

We will use the following function in the interval [0,5][0, 5] to compare the three algorithms:

The Midpoint Method

  • We first divide the region from aa to bb into NN equal rectangular segments

  • The width of each segment is Δx=baN\Delta x = \frac{b-a}{N}

  • The endpoint of the segments are

    x0=a<x1<<xN1<xN=b \small x_0 = a < x_1 < \cdots < x_{N-1} < x_{N} = b

    where

    xi=a+iΔx,i[0,N] \small x_i = a + i \cdot \Delta x, \quad i \in [0, N]

  • We estimate the area under the curve in each rectangular segment using the value of f(x)f(x) at the midpoint of each segment.

    areak=Δxf(xk+Δx2) , k[0,N) \small area_k = \Delta x \cdot f(x_k + \frac{\Delta x}{2})\ ,\ k \in [0, N)
  • To compute an approximation to the integral, we just have to sum these areas

Approximation improves as the number of segments is increased.

Implementation and Error analysis of Midpoint method

1def midpoint_integrate(f, a, b, N):
2 area = 0.0
3 dx = (b - a) / N
4 x = a
5
6 for k in range(N):
7 area += dx * f(x + dx / 2)
8 x += dx
9
10 return area

We know the integral of the example function is given by:

0511+x2dx=arctan(5)=1.37340076695 \small \int_{0}^{5} \frac{1}{1+x^2} dx = arctan(5) = 1.37340076695

So, we can compute error in the result produced by our implementation midpoint_integrate.

Error of Midpoint method for different values of NN:

Output
    N       estimated_integral                error
-----     --------------------     ----------------
   10         1.3735434283e+00     1.4266136666e-04
   20         1.3734392602e+00     3.8493263529e-05
   40         1.3734103959e+00     9.6289197895e-06
   80         1.3734031745e+00     2.4075766312e-06
  160         1.3734013689e+00     6.0191232953e-07
  320         1.3734009174e+00     1.5047571300e-07
  640         1.3734008046e+00     3.7615273785e-08
 1280         1.3734007764e+00     9.4000809359e-09
 2560         1.3734007693e+00     2.3462840559e-09
 5120         1.3734007675e+00     5.8283222693e-10
10240         1.3734007671e+00     1.4196355203e-10

Trapezoidal Method

To approximate the area of each segment, use the area of trapezoid rather than the rectangle.

The area of each trapezoidal segment is given by

areak=Δx2(f(xk)+f(xk+Δx)) , k[0,N) \small area_k = \frac{\Delta x}{2} \left( f(x_k) + f(x_k + \Delta x) \right)\ ,\ k \in [0, N)

Implementation and error analysis of the Trapezoidal method

area += dx * (f(x) + f(x + dx)) / 2
Output
    N       estimated_integral                error
-----     --------------------     ----------------
   10         1.3731040812e+00     2.9668571989e-04
   20         1.3733237548e+00     7.7012176612e-05
   40         1.3733815075e+00     1.9259456542e-05
   80         1.3733959517e+00     4.8152683754e-06
  160         1.3733995631e+00     1.2038458712e-06
  320         1.3734004660e+00     3.0096677106e-07
  640         1.3734006917e+00     7.5245528253e-08
 1280         1.3734007481e+00     1.8815126790e-08
 2560         1.3734007622e+00     4.7075219278e-09
 5120         1.3734007658e+00     1.1806169375e-09
10240         1.3734007667e+00     2.9889868358e-10

Simpson’s method

Given any three points there is a unique polynomial (parabola), called the interpolating polynomial, that passes through these points

  • Simpson’s method fits a parabola P(x)P(x) through the curve at three points — the value of the function at the two endpoints, and at the midpoint of the interval:

    f(xk), f(xk+Δx2), f(xk+Δx) \small f(x_k),\ f(x_k + \frac{\Delta x}{2}),\ f(x_k + \Delta x)
  • The area under the parabola in the interval [xk,xk+Δx][x_k, x_k + \Delta x] is given by:

    Δx6[f(xk)+4f(xk+Δx2)+f(xk+Δx)] \small \frac{\Delta x}{6} \left[f(x_k) + 4f(x_k + \frac{\Delta x}{2}) + f(x_k + \Delta x)\right]
  • Adding areas of all the NN segments gives an approximation to the integral.

  • Simpson’s method generally finds a better approximation to the area under the curve in each segment than trapezoidal method which uses a line instead of parabola.

  • For more details on where does the formula above come from: https://en.wikipedia.org/wiki/Simpson%27s_rule

Implementation and error analysis of the Simpson’s method

area += dx * (f(x) + 4 * f(x + dx / 2) + f(x + dx)) / 6
Output
    N       estimated_integral                error
-----     --------------------     ----------------
   10         1.3733969793e+00     3.7876621872e-06
   20         1.3734007584e+00     8.5498519375e-09
   40         1.3734007664e+00     5.3898707719e-10
   80         1.3734007669e+00     3.8371528177e-11
  160         1.3734007669e+00     7.0714545330e-12
  320         1.3734007669e+00     5.1121329392e-12
  640         1.3734007669e+00     4.9893422727e-12
 1280         1.3734007669e+00     4.9857895590e-12
 2560         1.3734007669e+00     4.9855675144e-12
 5120         1.3734007669e+00     4.9795723100e-12
10240         1.3734007669e+00     4.9829029791e-12

Check implementation and example in numerical_integration.py and integration_examples.py

Accuracy of the three methods

  • For all the three methods we saw, as the number of subintervals NN approaches infinity, approximation improves and approaches actual integral.

  • However, the speed of improvement differs:

    • Midpoint method — error1N2error \propto \frac{1}{N^2}
    • Trapezoidal method — error1N2error \propto \frac{1}{N^2}
    • Simpson’s method — error1N4error \propto \frac{1}{N^4}
  • Simpson’s method approaches the integral faster than other methods.

Integration using scipy.integrate

Check scipy_integrate.py file on Ed.