10. Understanding Experimental Data (cont.) by MIT OpenCourseWare

Description

10. Understanding Experimental Data (cont.) by MIT OpenCourseWare

Summary by www.lecturesummary.com: 10. Understanding Experimental Data (cont.) by MIT OpenCourseWare


  • Understanding Experimental Data

    • Recap: The lecture is about how to understand experimental data, which can come from physical, biological, or social experiments.
    • Goal: The aim is to fit models to data.
    • Objectives of a model: A good model should explain underlying phenomena, provide insight into the mechanism, and allow for predictions in new settings. Predicting behavior in new settings is a key objective. If data were perfect, this would be easy, but experimental uncertainty is always present.
  • Measuring and Finding the Best Fit

    • Accounting for uncertainty: Experimental uncertainty needs to be accounted for when fitting a model.
    • Measuring fit: Given a set of observed values and a model that predicts values, the goodness of fit can be measured using the sum of the squares of the differences between observed and predicted data. This expression is what the process aims to minimize. While other measures like first-order absolute value could be used, the square is handy because it simplifies the solution space.
    • Finding the best fit: The goal is to find the best curve or model for predicting values.
    • Focus on mathematical expressions: Polynomials are used as mathematical expressions for models.
    • Linear Regression: Finding the coefficients of a polynomial that minimize the sum of squared differences is an example of linear regression.
  • Linear Regression Process

    • Example (Spring): For a linear spring, a degree one polynomial (y = ax + b) can be fit, where a and b are free parameters.
    • Visualization: All possible lines (defined by A and B values) can be represented in a 2D space. An objective function surface can be imagined over this space, where the height represents the sum of squares value. Using the sum of squares results in a concave surface with a single minimum. Linear regression involves starting somewhere on this surface and "walking downhill" to find the single bottom point, which represents the best A and B values (the best line).
    • Generalization: This concept can be generalized to arbitrary dimensions for higher-order polynomials.
    • Tools: Pyab's polyfit function solves this linear regression problem. It takes x and y values and a degree, returning the coefficients of the best-fitting polynomial. polyval applies a given model (coefficients) to a set of x values to predict the corresponding y values.
  • Initial Model Fitting Example

    • Data set: An example data set is used.
    • Fitting a line: Fitting a degree one polynomial (line) to the data results in a "pretty ugly" fit when visually inspected. Although it's the best-fitting line, it doesn't account for the data's shape well.
    • Fitting a quadratic: Fitting a second-order model (quadratic, y = ax^2 + bx + c) looks "a lot better" visually, appearing to follow the data reasonably well.
    • Higher order models: The question arises: how do we know which order model (4th, 8th, 64th) is best?
  • Using Coefficient of Determination (R-squared)

    • Measuring fit when no theory: If there's no guiding theory (like Hook's Law for linearity), the coefficient of determination (R-squared) is the best way to measure how well a model fits the data.
    • Properties of R-squared: It is scale-independent and ranges from 0 to 1.
    • Calculation: R-squared is calculated based on the sum of squares of differences between observed and predicted values (numerator) and the variation of observed values from their mean (denominator).
    • Interpretation: An R-squared value close to one indicates a great fit, meaning the model accounts for most of the variation. A value closer to zero indicates a poor fit.
    • Example results: Fitting models of order 2, 4, 8, and 16 shows R-squared values increasing with model complexity. The order 16 fit has a R-squared of 0.997, indicating it accounts for all but 0.3% of the variation in the data. Visually, it follows most data points very well.
  • The Overfitting Puzzle

      • Question: If the order 16 fit is the best fit based on R-squared, should it be used? Just because you can do something doesn't mean you should.
      • Model objectives revisited: Remember the two reasons for building a model: explain phenomena and make predictions.
      • Explanation issue: A 16th-order model doesn't offer clear insight into the physical process, unlike a linear model for a spring.
      • Prediction issue: The ability to predict future behavior is crucial. A good model both explains and predicts.
    • Data Generation Process

      • Source of data: The example data was generated from a physical phenomenon that follows a parabolic arc (degree 2 polynomial). Examples include comets or the path of a thrown object under uniform gravity.
      • Method: The data was generated using the equation y = ax^2 + bx + c. Specifically, y = 3x^2 (with b and c being zero) was used, and significant Gaussian noise was added to the y values.
    • Testing on Training Data

      • Setup: Two different data sets were generated using the same degree 2 process with different random noise. Models of degrees 2, 4, 8, and 16 were fit to the first data set (gen_fits). These models were then tested on the same first data set (test_fits), yielding high R-squared values for higher order models, especially degree 16 (0.997).
      • Puzzle persists: The best-fitting model on the training data is still order 16, even though the data came from an order two polynomial.
    • Validation: Testing on New Data

      • Training error vs. Testing error: What was measured previously is training error (how well the model performs on data from which it was learned).
      • Generalization: To ensure the model captures the underlying process, it needs to perform well on other data generated from the same process.
      • Validation/Cross-validation: A crucial tool for this is validation or cross-validation.
      • Strategy: Generate models from one data set (training set) and test them on a different data set (test set).
      • Example code: Models built from the first data set are applied to the second data set, and vice versa.
    • Validation Results and Overfitting

      • Results (Model 1 on Data Set 2): When models trained on data set 1 are tested on data set 2, the R-squared for degree 16 drops significantly.
      • Results (Model 2 on Data Set 1): Testing models trained on data set 2 on data set 1 shows a similar result.
      • Conclusion: These results show that to predict other behavior, an order two or maybe order four polynomial is better than an order 16 polynomial.
      • Overfitting Explained: This phenomenon is called overfitting. It occurs when a model has too many degrees of freedom.
    • Why Overfitting Happens with Noise

      • Perfect Data Case: With perfect data, adding higher-order terms to a model wouldn't cause problems.
      • Noisy Data Case: With noisy data, even a little noise can cause problems.
      • Example (Noisy Linear Data): Fitting a quadratic to slightly noisy linear data results in small non-zero coefficients for higher-order terms.
      • Take-Home Message: Picking an overly complex model risks overfitting to training data noise.
    • Finding the Right Model Complexity

      • Trade-off: There is a trade-off between model complexity and fit.
      • Goal: The goal is to find the simplest possible model that still explains the data well.
      • Method: Start with a low-order model, assess its performance, increase the model order, and repeat.
      • Cross-Validation Techniques

        • Hook's Law Revisited: In the Hook's law example, even though the quadratic fit is tighter, the physical theory says the relationship should be linear until the elastic limit is reached. This suggests fitting different linear models to different segments of the data (before and after the elastic limit).
        • Guidance without Theory: When no theory guides model choice, cross-validation can be used to determine the appropriate model complexity.
        • Leave-One-Out Cross Validation: For small data sets. For each data point, remove it, train the model on the remaining data, and test the model's prediction against the removed point. Average the results over all points.
        • K-Fold Cross Validation: For larger data sets. Divide the data into K equal chunks. Leave out one chunk, train on the remaining K-1 chunks, and test on the left-out chunk. Repeat for each chunk being the test set and average results.
        • Repeated Random Sampling: For larger data sets. Run K trials. In each trial, randomly select N elements for the test set and use the remainder for the training set. Build the model on the training set and apply it to the test set.
      • Temperature Data Example (Using Random Sampling)

        • Data Set: Mean daily high temperature in the US from 1961 to 2015.
        • Task: Model how the mean daily high temperature has varied. Compute the mean high temperature for each year.
        • Method: Use cross-validation (random sampling). Run 10 trials. Try fitting linear, quadratic, cubic, and quartic models. In each trial, train on one half of the data and test on the other half. Record the R-squared value for the test set.
        • Code Details: Reads temperature data (high temp and year). Computes mean high temperature per year using a dictionary. Plots yearly means over time, showing an increase. Random sampling code splits the data set into train (indices not in random sample) and test (indices in random sample) sets. Loop runs through trials, gets random splits, fits models using polyfit on training data, predicts test data using polyval, computes R-squared on test data, and stores results.
      • Temperature Data Results

        • Results Table: Presents average R-squared values and their standard deviations across the 10 trials for linear, quadratic, cubic, and quartic fits.
        • Conclusion: The linear fit is likely the winner. It has the highest average R-squared (0.86) and the smallest standard deviation across trials (0.025). It is the simplest model.
        • Importance of Multiple Trials: Running multiple trials is important because even with random sampling, a single trial might yield a misleadingly low R-squared value, potentially leading to an incorrect conclusion about the best model. Running multiple trials provides statistics on the variability of the fit.
      • Lecture Summary

        • Linear Regression: Used to fit curves to data in various dimensions, mapping independent to dependent values for prediction.
        • R-squared: A way to measure fit, but validation is needed to see how well the model predicts new data.
        • Model Selection: The goal is to select the simplest model that effectively accounts for the data and predicts new data well. Model complexity can be guided by theory (like Hook's Law) or by cross-validation techniques (Leave-One-Out, K-Fold, Repeated Random Sampling).