13. Classification by MIT OpenCourseWare

Description

13. Classification by MIT OpenCourseWare

Summary by www.lecturesummary.com: 13. Classification by MIT OpenCourseWare


Introduction & Supervised Learning

    • Introduction to **Supervised Learning - Classification Introduction
    • The lecture starts with announcements and mentions the introduction of additional Python concepts, namely list comprehension subsequently.
    • The foremost topic is **classification**, a kind of supervised learning.
    • Supervised learning is categorized into two types: **Regression** (outputs a real number) and **Classification** (outputs a **discrete value** or **label**).
    • Classification occurs frequently in machine learning.
    • Examples of classification include forecasting reactions to a drug (sick/not sick) or predicting grades (A, B, C, D).
    • A sample dataset of animals with attributes and a 'reptile' label is presented.

Classification - Nearest Neighbor (NN)

    • The most basic method of classification is Nearest Neighbor (NN)**.
    • Learning is straightforward: store the training data.
    • To predict a new example's label, find the nearest example in the training data.
    • An example with red/black dots illustrates this method.
    • Applying NN to the animal example succeeded in labeling a zebra and a python correctly, but misclassified an alligator.
    • The issue with NN is its susceptibility to **noisy data** or outliers.

Classification - K Nearest Neighbors (KNN)

    • To avoid noisy data issues, K Nearest Neighbors (KNN) is typically employed.
    • KNN considers a count (`K`) of closest neighbors, which **vote** on the label.
    • An example with K=3 illustrates how voting can counteract outliers.
    • KNN is generally **far more trustworthy** than single Nearest Neighbor.
    • KNN can label correctly even in the presence of noisy neighbors.
    • There is a question of whether there is a limit to K.
    • Efficiency matters: larger K is more time-consuming.
    • A  key problem** with larger K is the potential for results to be dominated by the most frequent class**.

Classification - Selecting K for KNN (Cross Validation)

    • The technique for selecting the best value of `K` in KNN is explained.
    • It's similar to selecting K for K-means clustering.
    • The standard technique is called **cross validation**.
    • Divide your training data into smaller sets for training and testing.
    • Try various values of `K` on these splits to determine the best performance.
    • This helps choose the parameter value (here, `K`) for the algorithm.

Classification - Managing Ties in KNN Voting

    • What occurs in KNN if the votes of neighbors tie?
    • You may be "kind of stuck" in this situation.
    • To prevent ties, set `K` so that there will always be a majority.
    • For two classes, any odd number of K will avoid ties.

Classification - KNN Advantages and Disadvantages

    • Advantages of KNN**:
      • Learning is quite fast since it only requires memorizing training instances.
      • No sophisticated mathematical theory or model construction is needed.
      • The approach and outcome are **simple to explain**.
    • KNN disadvantages**:
      • It is **memory hungry since all training instances must be stored.
      • It can be slow** due to the time taken to classify new examples.
      • KNN is often inefficient for large prediction jobs.
      • KNN does **not report on the underlying process** that produced the data.

Example Application - The Titanic Disaster

    • The lecture applies an alternative classification method on passenger survival on the Titanic.
    • The data includes cabin class, age, and sex details for 1,046 passengers.

Model Evaluation - The Need for Metrics Beyond Accuracy

    • A key question is how to evaluate a machine learning model.
    • Simply using accuracy is often insufficient, especially with class imbalance.
    • On the Titanic,

Model Evaluation

The Need for Metrics Beyond Accuracy

    • A key question is how to evaluate a machine learning model. Simply using accuracy is often insufficient, especially when there is a class imbalance (one class is much more common than others).

      For example, on the Titanic, simply predicting that everyone perished would have:

      • 62% accuracy for passengers
      • 76% accuracy for crew

      However, this model has no utility for determining who actually survived. In medical scenarios, predicting a rare disease for everyone present can have very high accuracy but zero utility. Therefore, additional metrics need to be used for evaluation.

Typical Measures for Evaluation

      • Sensitivity (also referred to as Recall): Measures the extent to which the model detects positive cases (e.g., correctly detecting those who died).
      • Specificity: Measures how well it detects negative cases (e.g., accurately detecting those who did not die).
      • Positive Predictive Value (also referred to as Precision): Asks: if the model is positive, what is the chance that it's true?
      • Negative Predictive Value: Answers: if the model says negative, what is the chance that it's really true?

      These measures say different things, and the selection of which to prefer depends on the use. For instance, breast cancer screening may focus on sensitivity to not miss cases, whereas risky surgery decisions may focus on specificity to avoid unnecessary procedures.

Model Testing

Methods for Evaluating Classifiers

    • How to validate a classifier is extremely crucial. Two approaches are explained:

      • Leave One Out Testing: Employed in the case of small datasets. You take out one sample from the dataset, train the model on n-1 remaining samples, and test it on the lone deleted sample. This approach utilizes the maximum amount of data available for training in every iteration.
      • Repeated Random Subsampling: Applied for large datasets. You randomly divide the data into a training set (say, 80%) and a test set (say, 20%). This division, training, and testing is done multiple times with varying random divisions.

      These test procedures are akin to the procedure followed for parameter selection such as K in KNN. Python function templates for these functions are presented. A method known as lambda abstraction (or currying) is utilized to define functions suitable for the testing framework.

KNN on Titanic Data

Results

    • KNN is executed on the Titanic data using Leave One Out and Repeated Random Subsampling. Both testing methods give very similar results. The KNN model does significantly better than simply always predicting "didn't survive" (better than 62% accuracy).

Classification

Introduction to Logistic Regression

    • Another, and perhaps the most ubiquitous machine learning technique, is introduced: Logistic Regression. In contrast to linear regression that predicts a real number, logistic regression is meant to predict a probability of an event.

      Logistic regression determines weights for every feature (gender, cabin class, age, etc.). A positive weight indicates the feature is positively related to the outcome label that is being predicted (like 'survived'). A negative weight indicates a negative relationship.

      An optimization process is employed to calculate these weights from the training data.

Logistic Regression - Python Implementation & Application

    • The lecture demonstrates the use of Logistic Regression through the sklearn Python library. Important steps involve importing sklearn.linear_model.LogisticRegression.

      The fit function is employed to train the model on feature vectors and their respective labels. This is where weight optimization occurs.

      The coef_ property of the trained model object yields the learned weights per feature.

      The predict_proba function is called to use the model with new data (feature vectors) and obtain the probabilities of various labels.

      A Python idea, list comprehension, is described and employed as a handy method to construct the list of test feature vectors.

      To convert the predicted probabilities from predict_proba to discrete labels, a threshold is applied. By default, the threshold is usually 0.5.

Logistic Regression on Titanic Data

Results & Insights

    • The Titanic data is run using Logistic Regression with both test methods. Logistic Regression is faster to run than KNN since running the model (prediction on new data) is quick.

      Comparing performances, Logistic Regression beats KNN in this dataset slightly. One major strength of Logistic Regression is that it gives insight into the variables through the exposed learned

Performance Comparison

    • Comparing performances, Logistic Regression slightly beats KNN in this dataset.

Strengths of Logistic Regression

      • Insight into variables: It provides insight through the exposed learned feature weights (coefficients).
      • Weights for Titanic data: Looking at the weights for the label 'survived' reveals:
      • Class impact: A big positive weight for being in first class, a moderate positive weight for second class, and an implied negative effect for third class. This indicates higher-class passengers were more likely to survive.
      • Age impact: A negative weight for age, implying older travelers were less likely to survive.
      • Gender impact: A highly negative weight for gender (male), indicating males were far less likely to survive.
      • Predictor insight: These weights convey significant insight into which factors were good predictors of survival in the model.

Caution on Weight Interpretation

      • Warning: A caution is issued against taking feature weights too literally.
      • Correlated features: Weights can be affected by the existence of correlated features, making individual effects difficult to interpret. This issue will be discussed further.