Loading...

Skip to main content

LectureSummary Learning App

Lecture Summary

Lecture Summary

Summary On Demand

Pages

 Channel
 Appearance
 Help
 Feedback

Translate

LectureSummary Learning App

Search Button

Explore

 Science
 Commerce
 Arts
 Engineering
 BBA
 MBA
 BCA
 MCA

Help & Legal

 Information
 Help
 Contact
 Copyright
 Advertising
 Cookie Policy
Privacy Policy
Terms of Service

Share | 14. Classification and Statistical Sins by MIT OpenCourseWare

Other Apps

14. Classification and Statistical Sins by MIT OpenCourseWare

Views

May 17, 2025

Share | 14. Classification and Statistical Sins by MIT OpenCourseWare

Other Apps

Report

Description

14. Classification and Statistical Sins by MIT OpenCourseWare

Summary by www.lecturesummary.com: 14. Classification and Statistical Sins by MIT OpenCourseWare

0:00:00 - Introduction and Announcements

Course evaluation announcements (due Friday noon).

Code provided for the final exam available later today.

Advice to review the code beforehand and come to office hours if unsure.

0:00:44 - Review of Previous Lecture on KN&N vs Logistic Regression

Comparison of Titanic data results of KN&N (k=3) and Logistic Regression (P=0.5).

Logistic regression was slightly better, but by the margin of statistical significance.

Importance of not only predicting but also learning by looking at the model itself.

0:01:16 - Interpreting Logistic Regression Weights

Examining weights to see how variables affect survival.

Example weights: First class cabin (+1.6 strong positive effect), second class (+0.46 positive effect), third class (negative effect).

Age had a weak negative effect (older less likely to survive).

Male gender had a large negative effect (more likely to die).

Warning: Can only interpret relative weights, not absolute weights.

0:02:00 - Warning Points on Interpreting Weights

Be very cautious when computing weights one at a time because of correlated features.

Reference to L1 and L2 regularization in logistic regression.

L1 pushes weights to zero, which is helpful for high-dimensional data so that it doesn't overfit.

If variables are correlated, L1 may push one to zero, and the variable will appear insignificant.

L2 distributes weight over correlated variables, so they all appear less significant.

This has greater importance when there are hundreds or thousands of variables.

0:03:00 - Example: Correlated Cabin Class Variables

Cabin classes (C1, C2, C3) are correlated since an individual is usually in only one class (C1 + C2 + C3 = 1).

That indicates values are not independent.

Question: Is first class safe or second/third class dangerous? No easy answer because of correlation.

0:04:00 - Experiment: Removing a Correlated Feature (C1)

Experiment setup description: removal of the C1 binary feature from the data.

Code change illustrated to remove C1.

0:05:20 - Experiment Results: Different Weights, Same Performance

Accuracy did not reduce significantly after removing C1.

Weights greatly altered: Large negative weights are now on C2 and C3, instead of the large positive weight on C1 in the original model.

The entire idea: Be extremely cautious when you have correlated features of over-interpreting the weights.

Usually safe to trust the sign (positive or negative).

0:06:10 - Altering the Probability Cutoff (P)

Talking about probability cutoff P in logistic regression (default 0.5).

Attempting extreme values of P: 0.1 and 0.9.

Altering P will alter the decision boundary for survival prediction.

0:06:45 - Effects of Altering P on Metrics

Altering P will probably alter sensitivity, specificity, and positive predictive value.

It indicates a judgment regarding which kind of mistake is more significant (e.g., not underestimating survivors vs. overestimating).

0:07:00 - Results of Changing P Example (P=0.9)

Accuracy was greater with P=0.9 in this particular run.

Key observation: Significant difference in sensitivity.

With P=0.9, if you guess a person survived, they likely did (high positive predictive value likely, although not actually demonstrated in this results section but suggested by the account).

Introduction to Receiver Operating Characteristic (ROC) Curve

Measure: A measure to assess classifiers regardless of the particular cutoff P.

Objective: Examine the performance for all possible cutoffs.

How it works:

Construct one model,

Change P,

Use model with distinct P's on one test set,

Monitor results.

Plot: Sensitivity (Y-axis) against 1 minus Specificity (X-axis).

Area Under the Curve (AUC)

Presenting AUC: AUC summarizes the ROC curve.

Computed with: `sklearn.metrics.auc`.

Interpreting the ROC Curve Plot

The graph illustrates the balance between sensitivity and 1-specificity.

Bottom-left corner (0,0): Low sensitivity, high specificity (predicting no one positive, few false positives).

Top-right corner (1,1): High sensitivity, low specificity (predicting everyone positive, many false positives).

Would normally want to be somewhere in between (e.g., a "knee" in the curve).

Green line indicates a random classifier (AUC = 0.5).

The region between the random line and the curve represents how much better the model performs compared to random.

AUC Significance Question

Question posed: At what level does AUC become statistically significant?

Answer on AUC Significance

Effectively: An unanswerable question across the board.

Significance: Determined by sample size and what it's being compared to (e.g., better than random).

With an enormous sample size, a slight gain over 0.5 can be statistically significant but boring.

The actual question: Should be if the findings are useful, rather than statistically significant.

Question on Plotting 1-Specificity

Question asked: Why plot 1-specificity rather than specificity?

Answer on Plotting 1-Specificity

The reason: Is simply so we can calculate the area (AUC).

It forms a pretty concave curve through (0,0) and (1,1) points.

Algebraically: 1-specificity is equivalent to (can derive one from the other).

It's a clever trick to facilitate visualization and computation.

Transition to Statistical Sins

Shifting the subject of discussion to abuses of statistics.

Quote: "There are three kinds of lies: Lies, damned lies, and statistics".

Quote that goes to Mark Twain or Benjamin Israeli.

How to Lie with Statistics

Daryl Huff quote: "If you can't prove what you want to prove, demonstrate something else and pretend they're the same thing".

Statistical Sin: Statistics vs Data Visualization

Introducing Anscombe's Quartet.

Four sets of X-Y pairs with the same mean, variance, and linear regression equation.

However: When plotted, the data distributions are completely different.

Moral: Statistics about data is not the same thing as the data itself.

Easy to forget: That statistics don't tell the whole story.

Recommendation: Always plot your data first.

Statistical Sin: Lying with Pictures (Manipulating Axes)

Pictures can be great but can also be used to deceive.

Example: Grade chart by gender.

Original plot: With cut-off Y-axis (3.9 to 4.05) makes tiny difference seem huge.

Honest plot: With 0 to 5 Y-axis reveals slight difference.

Importance: Always check axis labels and scales.

Statistical Sin: Lying with Pictures (Non-comparable Data)

Example: Fox News chart comparing welfare recipients vs individuals with full-time jobs.

No Y-axis label, suggesting the baseline is not zero.

More significantly: Definitions are non-comparable.

Statistical Sins and Their Implications

0:18:45 - Statistical Sin: Lying with Pictures (Non-comparable Data)

Example: Fox News chart comparing welfare recipients vs individuals with full-time jobs.

No Y-axis label, suggesting the baseline is not zero.

Definitions are non-comparable.

"People on welfare" counts all household members if someone gets welfare.

"People with a full-time job" counts people who have a job.

This comparison creates a very misleading impression.

0:20:15 - Moral: Are the things you're comparing actually comparable?

This is a standard statistical sin.

0:20:20 - Statistical Sin: GIGO (Garbage In, Garbage Out)

Meaning: If you've got rubbish data in, you get rubbish results out.

Charles Babbage anecdote about his computational engine.

0:21:25 - Example: 1840s US Census on Slavery and Insanity

John Calhoun employed census figures to assert slavery benefited slaves.

This was challenged by John Quincy Adams.

Calhoun later acknowledged census mistakes but asserted they would average out.

Rebuttal: Errors were systematic (biased), not unbiased and independent.

The data was inherently flawed.

0:23:05 - Morale of GIGO: Analysis of Bad Data

Analysis of bad data is worse than no analysis at all.

Individuals tend to do proper statistical analysis on improper data and arrive at incorrect conclusions.

First question: Is the data worthwhile to analyze?

0:23:55 - Statistical Sin: Survivor Bias

Photo of a World War II fighter aircraft.

Examining damage on aircraft that returned to determine where to place armor.

Flaw: Should have examined the planes that were shot down (the non-survivors).

Sample (planes that flew back) is not a representation of all planes (including the ones downed).

0:25:10 - Survivor Bias in Sampling

Problem whenever sampling is used to make inferences about a population.

Statistical methods are based on random sampling.

Convenience sampling is typically not random.

Examples: Course feedback (students who dropped out are not sampled), marks (failing students drop out).

0:25:55 - Statistical Sin: Non-response Bias

A further category of non-representative sampling that occurs in opinion polls and surveys.

Respondents to surveys are not representative of the entire population.

0:26:30 - Problem with Non-Random/Non-Independent Samples

Still able to calculate simple statistics (mean, std dev).

Cannot make conclusions based on methods such as the Empirical Rule, Central Limit Theorem, or Standard Error because the assumption of random and independent samples is violated.

Example: Political polls that use landlines are leaving out a big chunk of the population (younger individuals).

0:27:35 - Moral of Sampling Issues

Always know how data was gathered and what the analysis is assuming.

Be very cautious with conclusions when assumptions are not met.

0:27:50 - Conclusion and What's Coming Up Next

Will complete statistical sins and deliver a course wrap-up in the next lecture.

Copyright © LectureSummary Learning App by deSolXperts R&D Pvt. Ltd. 2024

Privacy & Policy
Terms
Site Map

Powered by Blogger