Student Grade Prediction Example
GPA and previous programming experience are germane features; astrological sign and eye color are not.
Feature Selection
Adding too many features has a chance to overfit, where the algorithm discovers spurious correlations (e.g., between GPA and birth month).
- Objective: Maximize signal to noise ratio, retaining informative features and eliminating extraneous ones.
Reptile Labeling Example
Initial characteristics include:
- Egg-laying
- Scales
- Venomous
- Cold-blooded
- Legs
Beginning with cobras results in a model where all characteristics must be true.
The introduction of a boa constrictor (not egg-laying, not venomous) necessitates the need to improve the model.
Adding a chicken (not reptile) is a counterexample verifying the model.
Adding an alligator (reptile, but has legs) violates the model based on no legs.
Refining the model includes: scales, cold-blooded, zero or four legs.
Including a dart frog (amphibian, not reptile) as a counterexample to the fine model.
Including a Python (reptile) and Salmon (fish) illustrates the problem with existing characteristics (scales, cold-blooded, legs). The characteristics are unable to distinguish between them properly.
Selecting scales and cold-blooded as the features gives a model that misclassifies the salmon as reptile.
This demonstrates the necessity of design decisions and trade-offs, such as accepting false positives (classifying a non-reptile as a reptile, e.g., salmon) or false negatives (classifying a reptile as not a reptile). There usually is no best method for partitioning data.
Distance Measurement
After selecting the features, it is necessary to determine how to calculate the distance between vectors.
- For grouping or calculating dividing lines.
- Need to determine which features to use, the distance measure, and possibly the relative weight/scale of different dimensions.
- Reptile characteristics as a vector (integer and binary).
Distance Metrics
- Minkowski metric: General formula for distance between vectors.
- Manhattan metric (P=1): Sum of the absolute differences of components. Applicable in Manhattan with movement on a grid.
- Euclidean distance (P=2): Square root of the sum of squared differences in components. Default "as the crow flies" distance.
- The metric choice does make a difference: An example demonstrates a point being closer to one object by Euclidean and another by Manhattan.
- Euclidean distance applied to the reptile example (snakes, dart frog, alligator) reveals problems. The number of legs dimension (integer 0 or 4) overwhelms the distance calculation relative to binary features (0 or 1).
- This emphasizes that the scale of the dimension matters.
- Manhattan distance or transforming the 'legs' feature into a binary feature (has legs/no legs) would correct this situation and bring the alligator nearer to the snakes.
- Feature engineering, such as feature selection and determining the scale/weight of features, has a big effect on outcomes.
Clustering Method (Unsupervised Learning)
Objective: Cluster unlabeled data.
- Default algorithm (K-means similar):
- Determine the number of clusters (K).
- Choose K examples as starting exemplars (ideally distant from each other).
- Classify each alternate example into the cluster of its nearest exemplar.
- Determine the median element of each resulting cluster as new exemplars. (Clarification: the transcript mentions median, regular K-means employs mean)
- Reiterate until exemplars do not change considerably.
Using this to the football data produces clusters.
Validation: After forming clusters, test how well they perform, particularly with new data. The running back example illustrates that the two-cluster model has overlap/ambiguity.
Using three clusters for the football data provides a clearer distinction (receivers, tight ends/running backs, linemen).
Determining the number of clusters and preventing overfitting (e.g., 100 clusters for 100 examples) are significant restrictions
Understanding Clustering and Classification
Determining the number of clusters and preventing overfitting (e.g., 100 clusters for 100 examples) are significant restrictions.
Classification Method (Supervised Learning)
Goal: Acquire a rule to label new examples in the appropriate group based on labeled data.
Methods: Discover the least complex surface (line, plane) that divides groups. More complex surfaces may be required but can overfit.
K Nearest Neighbors (KNN)
For a new example, discover the K nearest labeled examples and have them vote on the label.
Evaluating Classifiers
Voter Example: Predicting voters (Republican/Democrat) based on age and distance from Boston. Training data presents overlap, so an ideal separating line isn't feasible. Various candidate lines are illustrated.
How to compare which classifier is superior?
Confusion Matrix
Contains predicted labels versus actual labels.
- True Positives (TP): Labeled correctly as the target instance.
- True Negatives (TN): Labeled correctly as not the target instance.
- False Positives (FP): Labeled as the target instance incorrectly (Type I error).
- False Negatives (FN): Labeled as not the target instance incorrectly (Type II error).
Accuracy
Calculated as (TP + TN) / Total examples. Indicates overall correctness. Simple lines provide ~70% accuracy on training data.
More complex models can minimize training error (greater accuracy, e.g., 83.3%) but have the danger of overfitting.
Validation: Use the model on test data to assess generalization.
Other Measures
- Positive Predictive Value (PPV): TP / (TP + FP). Ratio of positive labels that are accurate.
- Sensitivity (Recall): TP / (TP + FN). Ratio of actual positives that are well identified.
- Specificity: TN / (TN + FP). Ratio of correct negatives.
- There is usually a trade-off between sensitivity and specificity. The former may increase at the expense of the latter.
- Receiver Operator Characteristic (ROC) curve: Plots this trade-off.
Conclusion
The following lectures will further explore these ideas, such as objective functions and optimization techniques.
Takeaways: How to calculate distance, features to design, and deal with model constraints (such as the number of clusters or complexity of the classifier) to prevent overfitting.