11. Introduction to Machine Learning by MIT OpenCourseWare

Description

11. Introduction to Machine Learning by MIT OpenCourseWare

Summary by www.lecturesummary.com: 11. Introduction to Machine Learning by MIT OpenCourseWare


 Introduction and Context of Lecture

  • Welcome back to the lecture.
  • The last class is two weeks away.
  • The previous lectures covered linear regression, employing experimental data such as spring weights and displacements to infer a model.
  • Linear regression was about the best-fit line to data or the best model to use (linear, quadratic, cubic) through validation.
  • Linear regression is a transition to the subject of machine learning, the final large topic of the class.

 Introduction to Machine Learning

  • Machine learning is a large topic with several university courses concentrating on it.
  • Other applications such as natural language processing, computational biology, computer vision, and robotics depend heavily on machine learning nowadays.
  • The lecture series will give an overview of machine learning, with an emphasis on fundamental concepts.
  • Basic concepts include having examples, representing them with features, and measuring distances between them.
  • The idea of distance can be used to group similar things together.

 Types of Machine Learning 

  • The course will look at two standard ways of doing learning:
  • **Types of classification methods**: Examples include K nearest neighbor.
  • **Types of clustering methods**.
  • Classification is suitable for labeled data where labels are known for examples.
  • Clustering is suitable when labeled data is unavailable.

 Frequency and Examples of Machine Learning

  • Machine learning has come a long way since the lecturer began working on AI in 1975.
  • **Examples of machine learning applications:**
  • AlphaGo defeated a world champion Go player.
  • Recommendation systems (Netflix, Amazon).
  • Google ads based on interests.
  • Drug discovery.
  • Post Office character recognition.
  • Two Sigma (hedge fund employing AI/ML for stunning returns).
  • Siri.
  • Mobileye (computer vision for autonomous/driver-assisted driving, i.e., automatic braking).
  • Facial recognition (Facebook).
  • IBM Watson for diagnosing cancer.

 Defining Machine Learning 

  • The lecturer provocatively asserts nearly every computer program learns something, though the degree differs.
  • Newton's square root algorithm might technically qualify as learning, but it was carefully programmed.
  • Linear regression seemed more like learning in that the computer interpolated a curve to information, which was really learning a model to make predictions.
  • A desirable characteristic of a machine learning algorithm is a program that can learn from experience and apply that to infer new facts.
  • Art Samuel's 1959 definition: Machine learning is "the field of study that gives computers the ability to learn without being explicitly programmed".
  • Samuel created an early ML program that could play checkers, defeated national-level players, and learned to improve by watching its games.

**Traditional Programming vs. Machine Learning Approach**

  • **Traditional Programming**: A program is written by a programmer, data is inserted, and the computer gives output.
  • **Machine Learning Method**: The software developer provides the computer input *and* output (example, labels), and the machine learning process generates the program.
  • The resulting learned program can then be applied to make new inferences.
  • Curve fitting (such as linear regression) is a simplified version, learning a model from observations to predict new cases.

**How Computers Learn (Desired Capability)**

  • Humans learn through memorizing facts (declarative knowledge) or by inferring/deducing new information (imperative knowledge).
  • In ML, we desire programs that can deduce useful information from hidden patterns of data, not explicit ones.
  • The algorithm must discover these patterns and produce a program to deduce new data.

**The Basic Machine Learning Paradigm**

  • Provide the system with training data (observations).
  • Figure out how

The Basic Machine Learning Paradigm

  • Provide the system with training data (observations).
  • Write code (inference engine) that infers something about the process that generated the data.
  • Use the inferred model/program to make predictions about unseen data.
  • The spring example fits this: data (mass/displacement) → inference (spring constant/linear equation) → prediction (new displacements).

Examples, Features, and Labels

  • A more common ML setting involves giving examples with associated data/features and labels.
  • The goal is to figure out how to infer labels for new, unlabeled examples.
  • Football player example: Examples are players, label is their position (receiver, lineman), data/features are height and weight.
  • Objective: Understand how height and weight forecast position in order to forecast the position of new players (e.g., draft picks).

Supervised vs. Unsupervised Learning (Revisited)

  • Supervised Learning: There is a known label for each training example. The aim is to discover a rule to predict labels for unseen input in light of these labeled examples.
  • Unsupervised Learning: Examples are given, but the labels are unknown. The goal is to find natural ways to group (cluster) those examples together.

Illustrating Concepts with the Football Example

  • Plotting height and weight data for receivers (red) and linemen (blue).
  • Unlabeled Case (Clustering):
    • Decide what makes players similar (a distance measure).
    • Goal: Separate the distribution into natural groups.
    • Using only weight results in a dividing line at ~280 lbs.
    • Using only height is less clean.
    • Using both height and weight gives a clearer separation into two clusters.
    • A dividing line equidistant from the cluster centers can act as a classifier.
  • Typical clustering algorithm (K-means type): Choose initial exemplars, assign other points to closest exemplar, find cluster medians as new exemplars, iterate.

Labeled Case (Classification)

  • Knowing the labels (receivers vs. linemen) allows finding a subsurface (a line in 2D) that divides the space.
  • If data is well separated, this is easy.
  • Must prevent overfitting from producing overly complex surfaces. Some inaccurate labels may have to be tolerated.
  • Ideal best fit line for the football data would be approximately 280 lbs.

Labeling New Data (Running Backs Example)

  • Adding two new examples (running backs - black dots) to the plot.
  • With the unlabeled clustering model, one new point is clearly near the receiver cluster, but the other is very close to the dividing line, making classification unclear.
  • Using the labeled classification model, the two new points lie distinctly below the line, labeling them as more receivers than linemen.
  • This illustrates how labeled and unlabeled data result in differing means of constructing classifiers/clusters.

Machine Learning Pipeline: Must-Haves

  • All machine learning techniques have five must-haves:
    • 1. Training Data & Evaluation: Choose what data to work with and how to define success.
    • 2. Representation: How to represent each case (e.g., how to pick features).
    • 3. Distance: How to calculate distance between instances/features.
    • 4. Objective Function: How to determine the optimal cluster/model (optimize a function). (To be discussed later)
    • 5. Optimization Method: How to train the model parameters. (To be discussed later)
  • Today's lecture is on Representation (Features) and Distance.

Feature Engineering

  • Features are attributes of examples helpful in making similarity decisions or in discovering cut-off lines.
  • It is important to select appropriate features; simple-to-access features (such as height/weight) may not be the most predictive.
  • Feature engineering is the process of determining what to measure and how to assign weights to their relative significance.
  • Student Grade Prediction Example

    GPA and previous programming experience are germane features; astrological sign and eye color are not.

    Feature Selection

    Adding too many features has a chance to overfit, where the algorithm discovers spurious correlations (e.g., between GPA and birth month).

    • Objective: Maximize signal to noise ratio, retaining informative features and eliminating extraneous ones.

    Reptile Labeling Example

    Initial characteristics include:

    • Egg-laying
    • Scales
    • Venomous
    • Cold-blooded
    • Legs

    Beginning with cobras results in a model where all characteristics must be true.

    The introduction of a boa constrictor (not egg-laying, not venomous) necessitates the need to improve the model.

    Adding a chicken (not reptile) is a counterexample verifying the model.

    Adding an alligator (reptile, but has legs) violates the model based on no legs.

    Refining the model includes: scales, cold-blooded, zero or four legs.

    Including a dart frog (amphibian, not reptile) as a counterexample to the fine model.

    Including a Python (reptile) and Salmon (fish) illustrates the problem with existing characteristics (scales, cold-blooded, legs). The characteristics are unable to distinguish between them properly.

    Selecting scales and cold-blooded as the features gives a model that misclassifies the salmon as reptile.

    This demonstrates the necessity of design decisions and trade-offs, such as accepting false positives (classifying a non-reptile as a reptile, e.g., salmon) or false negatives (classifying a reptile as not a reptile). There usually is no best method for partitioning data.

    Distance Measurement

    After selecting the features, it is necessary to determine how to calculate the distance between vectors.

    • For grouping or calculating dividing lines.
    • Need to determine which features to use, the distance measure, and possibly the relative weight/scale of different dimensions.
    • Reptile characteristics as a vector (integer and binary).

    Distance Metrics

    • Minkowski metric: General formula for distance between vectors.
    • Manhattan metric (P=1): Sum of the absolute differences of components. Applicable in Manhattan with movement on a grid.
    • Euclidean distance (P=2): Square root of the sum of squared differences in components. Default "as the crow flies" distance.
    • The metric choice does make a difference: An example demonstrates a point being closer to one object by Euclidean and another by Manhattan.
    • Euclidean distance applied to the reptile example (snakes, dart frog, alligator) reveals problems. The number of legs dimension (integer 0 or 4) overwhelms the distance calculation relative to binary features (0 or 1).
    • This emphasizes that the scale of the dimension matters.
    • Manhattan distance or transforming the 'legs' feature into a binary feature (has legs/no legs) would correct this situation and bring the alligator nearer to the snakes.
    • Feature engineering, such as feature selection and determining the scale/weight of features, has a big effect on outcomes.

    Clustering Method (Unsupervised Learning)

    Objective: Cluster unlabeled data.

    • Default algorithm (K-means similar):
      1. Determine the number of clusters (K).
      2. Choose K examples as starting exemplars (ideally distant from each other).
      3. Classify each alternate example into the cluster of its nearest exemplar.
      4. Determine the median element of each resulting cluster as new exemplars. (Clarification: the transcript mentions median, regular K-means employs mean)
      5. Reiterate until exemplars do not change considerably.

    Using this to the football data produces clusters.

    Validation: After forming clusters, test how well they perform, particularly with new data. The running back example illustrates that the two-cluster model has overlap/ambiguity.

    Using three clusters for the football data provides a clearer distinction (receivers, tight ends/running backs, linemen).

    Determining the number of clusters and preventing overfitting (e.g., 100 clusters for 100 examples) are significant restrictions

    Understanding Clustering and Classification

    Determining the number of clusters and preventing overfitting (e.g., 100 clusters for 100 examples) are significant restrictions.

    Classification Method (Supervised Learning)

    Goal: Acquire a rule to label new examples in the appropriate group based on labeled data.

    Methods: Discover the least complex surface (line, plane) that divides groups. More complex surfaces may be required but can overfit.

    K Nearest Neighbors (KNN)

    For a new example, discover the K nearest labeled examples and have them vote on the label.

    Evaluating Classifiers

    Voter Example: Predicting voters (Republican/Democrat) based on age and distance from Boston. Training data presents overlap, so an ideal separating line isn't feasible. Various candidate lines are illustrated.

    How to compare which classifier is superior?

    Confusion Matrix

    Contains predicted labels versus actual labels.

    • True Positives (TP): Labeled correctly as the target instance.
    • True Negatives (TN): Labeled correctly as not the target instance.
    • False Positives (FP): Labeled as the target instance incorrectly (Type I error).
    • False Negatives (FN): Labeled as not the target instance incorrectly (Type II error).

    Accuracy

    Calculated as (TP + TN) / Total examples. Indicates overall correctness. Simple lines provide ~70% accuracy on training data.

    More complex models can minimize training error (greater accuracy, e.g., 83.3%) but have the danger of overfitting.

    Validation: Use the model on test data to assess generalization.

    Other Measures

    • Positive Predictive Value (PPV): TP / (TP + FP). Ratio of positive labels that are accurate.
    • Sensitivity (Recall): TP / (TP + FN). Ratio of actual positives that are well identified.
    • Specificity: TN / (TN + FP). Ratio of correct negatives.
    • There is usually a trade-off between sensitivity and specificity. The former may increase at the expense of the latter.
    • Receiver Operator Characteristic (ROC) curve: Plots this trade-off.

    Conclusion

    The following lectures will further explore these ideas, such as objective functions and optimization techniques.

    Takeaways: How to calculate distance, features to design, and deal with model constraints (such as the number of clusters or complexity of the classifier) to prevent overfitting.