9. Understanding Experimental Data (cont.) by MIT OpenCourseWare

Description

9. Understanding Experimental Data (cont.) by MIT OpenCourseWare

Summary by www.lecturesummary.com: 9. Understanding Experimental Data (cont.) by MIT OpenCourseWare


    • 0:00 - 0:45 - Course Introduction and Overview

      • Reading assignment (Chapter 18) for this and subsequent lectures
      • No lecture on Wednesday (Thanksgiving break)

      Living in a Data-Intensive World

      • Increasing time spent dealing with data, often involving writing or hiring code
      • Focus on understanding software for data manipulation, writing such code, and interpreting software output regarding data
      • Beginning with experimental data – "statistics meets experimental science"

      0:45 - 2:28 - Collecting and Processing Experimental Data

      • Process is to perform an experiment (physics, biology, chemistry, sociology, anthropology) to collect data
      • Data types are measurements (lab) or answers (questionnaire)
      • Post-data collection: apply a model or theory to pose questions regarding the data
      • Goal: apply data and model to forecast future expectations or outcomes
      • Construct a computation to provide answers, executing a computational experiment to supplement the physical/social one
      • Example: spring modeling

      2:28 - 4:28 - Linear Springs and Hooke's Law

      • Attention given to linear springs (such as those in laboratory settings)
      • Feature: force to compress or extend varies linearly with distance
      • Affiliated with a spring constant (K) that specifies the amount of force required
      • Examples of K values: slinky (low K, 1 N/m), motorcycle suspension (high K, 35,000 N/m)
      • Newton defined: force to accelerate 1 kg mass 1 m/s²
      • Hooke's Law of Elasticity (Robert Hook, 1676): Force is linearly related to distance (F = -K*d)
      • Negative sign shows that force is opposite direction of displacement (restoring force)
      • Hooke's Law applies to a wide range of springs but is not without limits
      • Fails apart beyond the elastic limit (stretched or squeezed too far)
      • Does not work with all springs (e.g., rubber bands, recurve bows)

      4:28 - 5:48 - Using Hooke's Law (Sample Calculation)

      • Sample: determining rider mass to compress a 35,000 N/m spring by 1 cm
      • Convert distance to meters (1 cm = 0.01 m)
      • Force = K * distance = 35,000 N/m * 0.01 m = 350 Newtons
      • Applying F = ma, where acceleration equals gravity (around 9.8 m/s²)
      • Mass = Force / gravity = 350 N / 9.8 m/s² ≈ 35.68 kg
      • Which is equal to about 79 lbs
      • Shows how Hooke's Law can be used after K has been determined

      5:48 - 7:00 - Experimental Determination of Spring Constant

      • Importance of knowing the spring constant (e.g., atomic force microscopes, deformation of DNA)
      • Routine physics lab experiment: hang spring, attach mass, take displacement measurement
      • Solve K using F = K*d, which is rearranged as K = Force / distance
      • Force = mass * gravity (mass * 9.8 m/s²)
      • Ideal: it would take a single measurement
      • Real world: materials are not perfect, measurements noisy, require multiple trials with varying masses

      7:00 - 8:48 - Dealing with Experimental Data (Plotting)

      • Experimental data from more than one trial (mass vs. displacement)
      • Ideally, data will be linear

        Dealing with Experimental Data (Plotting)

        • Experimental data from more than one trial (mass vs. displacement)
        • Ideally, data will be linear
        • Plotting the data: independent variable (masses) on x-axis, dependent variable (displacement) on y-axis
        • Code walkthrough: reading data from a file, converting data to Pyab arrays with `array` function
        • Benefit of arrays: allows direct mathematical operations on array elements (such as scaling) without explicit loops
        • Plotting the gathered data

        Fitting a Curve to Data and Measuring Goodness of Fit

        • Data does not precisely trace a straight line, suggests noise
        • Objective: fit a line (or curve) to the data to map the underlying relationship in spite of noise
        • Must connect independent (x) variable with dependent (y) variable
        • Require an objective function that quantifies how close the best fit line is to the data points
        • Objective: identify the line/curve that has the smallest objective function (best fit)
        • Quantifying distance from points to line: vertical displacement (y-value difference) is used
        • Reason for using vertical displacement: estimating the dependent (y) value from the independent (x) value, the uncertainty is in the y-direction

        The Objective Function: Least Squares

        • Objective function as the sum of squared differences between the observed (measured) and the predicted (from the fitted curve) y-values
        • Difference = observed_y - predicted_y
        • Squaring the difference:
          • Eliminates the sign (direction of displacement is unimportant)
          • Gives a property helpful for obtaining the best fit (explained later - results in a surface with only one minimum)
        • This is referred to as least squares
        • It is in the form of variance multiplied by the number of observations, or average squared error
        • To minimize this expression is to minimize the variance between estimated and measured values

        Finding the Best Fit: Polynomials and Linear Regression

        • To keep the objective function to a minimum, must determine the parameters of the curve (e.g., intercept and slope of a line)
        • Assume the model of the forecast curve is a polynomial in the independent variable (x)
        • A line is a degree 1 polynomial (y = ax + b)
        • A parabola is a 2nd-degree polynomial (y = ax² + bx + c)
        • Linear regression is the method to determine the coefficients of the polynomial that best minimize the sum squared difference
        • Visualization: parameters (A, B for a line) create a multi-dimensional space, the objective function creates a surface over this space, the best fit is the lowest point on this surface
        • Using sum of squares ensures the surface has only one minimum
        • Linear regression discovers this minimum by essentially "walking downhill" down the gradient ("linearly regress")

        Applying `polyfit` in Pyab

        • Pyab library offers `polyfit` function for linear regression
        • `polyfit(x_values, y_values, degree_n)` computes coefficients for optimal degree n polynomial least squares fit
        • Returns a tuple of the coefficients (e.g., (a, b) for degree 1, (a, b, c) for degree 2)

        Fitting a Line (Degree 1) to Spring Data

        • Code `fit_data` illustrates how `polyfit(x_vals, y_vals, 1)` is utilized
        • Retrieves coefficients (a, b) of the best fit line
        • Applies coefficients to calculate anticipated y-values (estimated) for the provided x-values
        • Traces out raw data and the line of best fit
        • Spring constant calculation: K is the negative reciprocal of the slope of the line (a) (K ≈ -1/a)
        • Running code shows an example fit to the spring data
        • Calculated K ≈ 21.5 (from a ≈ 0.46)
        • Visual result shows a pretty good fit to the majority of the data, though some deviation ("funky") at higher values 
    • Using `polyval` and Fitting Higher Order Polynomials (Example 2)

      python
      polyval(model_coefficients_tuple, x_values)
      

      The `polyval` function in Pyab:

      • Accepts coefficients (result of `polyfit`) and x-values, giving predicted y-values.
      • Benefit: can be used for any polynomial order, making code reusable for other models.

      In this section:

      • Second example data set introduced and plotted.
      • Fitting a line (degree 1) shows a "lousy fit" when visually inspected.
      • Trying a higher order model: fitting a parabola (degree 2).
      • Fitting a quadratic is also a type of linear regression (higher dimensional).
      • Visual result shows the quadratic fit is clearly better than the linear fit for the second data set.

      Evaluating Goodness of Fit (Relative vs. Absolute)

      Question: How to objectively determine whose fit is better (beyond eyeballing)?

      • Comparing fits: relative (which one is better than another?) and absolute (how close to optimal?).
      • A measure for relative fit: Average Mean Squared Error (MSE).
      • MSE = sum of squared differences / number of samples.
      • Function `get_average_error` computes this.
      • Comparing MSE for quadratic and linear fits: MSE quadratic is six times smaller than MSE linear.

      MSE is suitable for relative comparison but has the following limitations:

      • Not absolute: does not indicate whether an MSE value is "good" or "bad".
      • Not scale independent: varies with data values.

      Absolute Goodness of Fit: Coefficient of Determination (R²)

      Standardized Coefficient of Determination (R²) as a scale-independent measure.

      • R² = 1 - (Model's sum of squared errors / Total sum of squares of the data).
      • Numerator: calculates error of the fit (sum of (observed - predicted)²).
      • Denominator: measures overall variability of the data (sum of (observed - mean of observed)²).
      • R² measures the proportion of variability in the data explained by the model.
      • For linear regression, R² will always be between 0 and 1.
      • R² = 1: Model accounts for all variability (ideal fit).
      • R² = 0: Model accounts for no variability (fit no better than simply using the mean of the data).
      • R² ≈ 0.5: Model explains about half the variability.
      • Objective: Determine a fit with R² value as close to 1 as possible.

      Testing Fits with R² and Multiple Degrees

      Code functions `gen_fits` and `test_fits` to test several polynomial degrees and output R².

      • `gen_fits` uses `polyfit` to generate models for a list of degrees.
      • `test_fits` uses `polyval` for prediction and calculates R² for each model.
      • Running with second data set and degrees 1 and 2:
        • R² for linear fit is horrible (< 0.005).
        • R² for quadratic fit is good (~0.84).
        • R² confirms quadratic is much better.
      • Running with higher degrees (2, 4, 8, 16):
        • R² values rise with degree (2 ≈ 0.84, 4 slightly higher, 8 slightly higher).
        • Degree 16 is very high R² (≈ 0.97), explaining nearly 97% variability.