14. Classification and Statistical Sins by MIT OpenCourseWare

Description

14. Classification and Statistical Sins by MIT OpenCourseWare

Summary by www.lecturesummary.com: 15. Statistical Sins and Wrap Up by MIT OpenCourseWare


0:00:00 - Introduction and Announcements

  • Course evaluation announcements (due Friday noon).
  • Code provided for the final exam available later today.
  • Advice to review the code beforehand and come to office hours if unsure.

0:00:44 - Review of Previous Lecture on KN&N vs Logistic Regression

  • Comparison of Titanic data results of KN&N (k=3) and Logistic Regression (P=0.5).
  • Logistic regression was slightly better, but by the margin of statistical significance.
  • Importance of not only predicting but also learning by looking at the model itself.

0:01:16 - Interpreting Logistic Regression Weights

  • Examining weights to see how variables affect survival.
  • Example weights: First class cabin (+1.6 strong positive effect), second class (+0.46 positive effect), third class (negative effect).
  • Age had a weak negative effect (older less likely to survive).
  • Male gender had a large negative effect (more likely to die).
  • Warning: Can only interpret relative weights, not absolute weights.

0:02:00 - Warning Points on Interpreting Weights

  • Be very cautious when computing weights one at a time because of correlated features.
  • Reference to L1 and L2 regularization in logistic regression.
  • L1 pushes weights to zero, which is helpful for high-dimensional data so that it doesn't overfit.
  • If variables are correlated, L1 may push one to zero, and the variable will appear insignificant.
  • L2 distributes weight over correlated variables, so they all appear less significant.
  • This has greater importance when there are hundreds or thousands of variables.

0:03:00 - Example: Correlated Cabin Class Variables

  • Cabin classes (C1, C2, C3) are correlated since an individual is usually in only one class (C1 + C2 + C3 = 1).
  • That indicates values are not independent.
  • Question: Is first class safe or second/third class dangerous? No easy answer because of correlation.

0:04:00 - Experiment: Removing a Correlated Feature (C1)

  • Experiment setup description: removal of the C1 binary feature from the data.
  • Code change illustrated to remove C1.

0:05:20 - Experiment Results: Different Weights, Same Performance

  • Accuracy did not reduce significantly after removing C1.
  • Weights greatly altered: Large negative weights are now on C2 and C3, instead of the large positive weight on C1 in the original model.
  • The entire idea: Be extremely cautious when you have correlated features of over-interpreting the weights.
  • Usually safe to trust the sign (positive or negative).

0:06:10 - Altering the Probability Cutoff (P)

  • Talking about probability cutoff P in logistic regression (default 0.5).
  • Attempting extreme values of P: 0.1 and 0.9.
  • Altering P will alter the decision boundary for survival prediction.

0:06:45 - Effects of Altering P on Metrics

  • Altering P will probably alter sensitivity, specificity, and positive predictive value.
  • It indicates a judgment regarding which kind of mistake is more significant (e.g., not underestimating survivors vs. overestimating).

0:07:00 - Results of Changing P Example (P=0.9)

  • Accuracy was greater with P=0.9 in this particular run.
  • Key observation: Significant difference in sensitivity.
  • With P=0.9, if you guess a person survived, they likely did (high positive predictive value likely, although not actually demonstrated in this results section but suggested by the account).

Introduction to Receiver Operating Characteristic (ROC) Curve

  • Measure: A measure to assess classifiers regardless of the particular cutoff P.

    Objective: Examine the performance for all possible cutoffs.

How it works:

    • Construct one model,
    • Change P,
    • Use model with distinct P's on one test set,
    • Monitor results.

    Plot: Sensitivity (Y-axis) against 1 minus Specificity (X-axis).

Area Under the Curve (AUC)

  • Presenting AUC: AUC summarizes the ROC curve.

    Computed with: `sklearn.metrics.auc`.

Interpreting the ROC Curve Plot

    • The graph illustrates the balance between sensitivity and 1-specificity.
    • Bottom-left corner (0,0): Low sensitivity, high specificity (predicting no one positive, few false positives).
    • Top-right corner (1,1): High sensitivity, low specificity (predicting everyone positive, many false positives).
    • Would normally want to be somewhere in between (e.g., a "knee" in the curve).
    • Green line indicates a random classifier (AUC = 0.5).
    • The region between the random line and the curve represents how much better the model performs compared to random.

AUC Significance Question

  • Question posed: At what level does AUC become statistically significant?

Answer on AUC Significance

  • Effectively: An unanswerable question across the board.

    Significance: Determined by sample size and what it's being compared to (e.g., better than random).

    With an enormous sample size, a slight gain over 0.5 can be statistically significant but boring.

    The actual question: Should be if the findings are useful, rather than statistically significant.

Question on Plotting 1-Specificity

  • Question asked: Why plot 1-specificity rather than specificity?

Answer on Plotting 1-Specificity

  • The reason: Is simply so we can calculate the area (AUC).

    It forms a pretty concave curve through (0,0) and (1,1) points.

    Algebraically: 1-specificity is equivalent to (can derive one from the other).

    It's a clever trick to facilitate visualization and computation.

Transition to Statistical Sins

  • Shifting the subject of discussion to abuses of statistics.

    Quote: "There are three kinds of lies: Lies, damned lies, and statistics".

    Quote that goes to Mark Twain or Benjamin Israeli.

How to Lie with Statistics

  • Daryl Huff quote: "If you can't prove what you want to prove, demonstrate something else and pretend they're the same thing".

Statistical Sin: Statistics vs Data Visualization

  • Introducing Anscombe's Quartet.

    • Four sets of X-Y pairs with the same mean, variance, and linear regression equation.
    • However: When plotted, the data distributions are completely different.
    • Moral: Statistics about data is not the same thing as the data itself.
    • Easy to forget: That statistics don't tell the whole story.
    • Recommendation: Always plot your data first.

Statistical Sin: Lying with Pictures (Manipulating Axes)

    • Pictures can be great but can also be used to deceive.
    • Example: Grade chart by gender.
    • Original plot: With cut-off Y-axis (3.9 to 4.05) makes tiny difference seem huge.
    • Honest plot: With 0 to 5 Y-axis reveals slight difference.
    • Importance: Always check axis labels and scales.

Statistical Sin: Lying with Pictures (Non-comparable Data)

    • Example: Fox News chart comparing welfare recipients vs individuals with full-time jobs.
    • No Y-axis label, suggesting the baseline is not zero.
    • More significantly: Definitions are non-comparable.

Statistical Sins and Their Implications

    • 0:18:45 - Statistical Sin: Lying with Pictures (Non-comparable Data)

      • Example: Fox News chart comparing welfare recipients vs individuals with full-time jobs.
      • No Y-axis label, suggesting the baseline is not zero.
      • Definitions are non-comparable.
      • "People on welfare" counts all household members if someone gets welfare.
      • "People with a full-time job" counts people who have a job.
      • This comparison creates a very misleading impression.

      0:20:15 - Moral: Are the things you're comparing actually comparable?

      • This is a standard statistical sin.
      • 0:20:20 - Statistical Sin: GIGO (Garbage In, Garbage Out)
      • Meaning: If you've got rubbish data in, you get rubbish results out.
      • Charles Babbage anecdote about his computational engine.

      0:21:25 - Example: 1840s US Census on Slavery and Insanity

      • John Calhoun employed census figures to assert slavery benefited slaves.
      • This was challenged by John Quincy Adams.
      • Calhoun later acknowledged census mistakes but asserted they would average out.
      • Rebuttal: Errors were systematic (biased), not unbiased and independent.
      • The data was inherently flawed.

      0:23:05 - Morale of GIGO: Analysis of Bad Data

      • Analysis of bad data is worse than no analysis at all.
      • Individuals tend to do proper statistical analysis on improper data and arrive at incorrect conclusions.
      • First question: Is the data worthwhile to analyze?

      0:23:55 - Statistical Sin: Survivor Bias

      • Photo of a World War II fighter aircraft.
      • Examining damage on aircraft that returned to determine where to place armor.
      • Flaw: Should have examined the planes that were shot down (the non-survivors).
      • Sample (planes that flew back) is not a representation of all planes (including the ones downed).

      0:25:10 - Survivor Bias in Sampling

      • Problem whenever sampling is used to make inferences about a population.
      • Statistical methods are based on random sampling.
      • Convenience sampling is typically not random.
      • Examples: Course feedback (students who dropped out are not sampled), marks (failing students drop out).

      0:25:55 - Statistical Sin: Non-response Bias

      • A further category of non-representative sampling that occurs in opinion polls and surveys.
      • Respondents to surveys are not representative of the entire population.

      0:26:30 - Problem with Non-Random/Non-Independent Samples

      • Still able to calculate simple statistics (mean, std dev).
      • Cannot make conclusions based on methods such as the Empirical Rule, Central Limit Theorem, or Standard Error because the assumption of random and independent samples is violated.
      • Example: Political polls that use landlines are leaving out a big chunk of the population (younger individuals).

      0:27:35 - Moral of Sampling Issues

      • Always know how data was gathered and what the analysis is assuming.
      • Be very cautious with conclusions when assumptions are not met.

      0:27:50 - Conclusion and What's Coming Up Next

      Will complete statistical sins and deliver a course wrap-up in the next lecture.