8. Sampling and Standard Error by MIT OpenCourseWare

Description

8. Sampling and Standard Error by MIT OpenCourseWare

Summary by www.lecturesummary.com: 8. Sampling and Standard Error by MIT OpenCourseWare


  • Introduction and Inferential Statistics (0:00)

    • Lecture 8 is well over halfway through course lectures.
    • The subject of this lecture is sampling.
    • Inferential statistics is concerned with making inferences about populations based on the examination of one or more random samples from that population.
    • Monte Carlo simulation was employed in the last two lectures to produce random samples and calculate confidence intervals based on the empirical rule.
    • Attention is then turned to sampling from something "real" if repeated experimentation is not feasible.
    • Political opinion polls are used as an illustration of real-world sampling of one poll over multiple simulations.

    Probability Sampling (1:55)

    • Probability sampling provides a non-zero chance that every member of the population could end up in a sample.
    • There are two primary types: Simple Random Sampling and Stratified Sampling.
    • Simple Random Sampling: Every member of the population will have an equal chance of being selected, with no bias.
    • Stratified Sampling: Applied when simple random sampling would be unsuitable, e.g., in non-evenly distributed subgroups.
      • The population is divided into subgroups.
      • A simple random sample is drawn from each subgroup, in proportion to the size of the subgroups in the population.
    • Sample: Taking a survey of MIT students, having proper representation from various schools such as engineering and architecture proportionally.
    • Political surveys frequently employ stratified sampling (i.e., sampling rural, city, minority).
    • Stratified sampling has the advantage of making sure small subgroups are included and might even lower the required sample size.
    • Properly doing it is challenging; the lecture will remain with simple random samples.

    Temperature Data Example (5:15)

    • A sample use involves temperature observations in the United States.
    • The source of the data is the US Centers for Environmental Information.
    • It comprises daily high and low temperatures in 21 American cities from 1961 to 2015.
    • The data set consists of approximately 422,000 examples.
    • This section of the course (including this lecture) is on data science and data analysis.
    • Initial analysis is looking at the data, not every sample, but through a plot (histogram) to get an idea of what it looks like.
    • Standard deviation uses numpy.std and random.sample for simple random samples.
    • random.sample accepts a sequence and sample size, and it returns a list of randomly selected unique elements.
    • This is sampling without replacement, i.e., an element is taken out from the population once it is sampled.
    • Sampling with replacement enables one to draw the same example more than once, which will be covered later during the term.

    Examining Population Data and a Single Sample (9:05)

    • A histogram of the population's daily high temperatures has a mean of 16.3 degrees Celsius and a standard deviation of around 9.4 degrees.
    • The population is not normally distributed but not too far off, with a cold temperature tail.
    • One random sample of size 100 has a histogram that is not like the histogram of the population.
    • For the single sample, standard deviation was 10.4 and mean was 17.07.
    • The standard deviation and sample mean are "in the same ballpark" as the population ones, and the question is whether this should be expected or a matter of chance.

    Multiple Samples Analysis (11:52)

    • To determine whether the closeness is to be expected or a matter of luck, the experiment of sampling 100 random samples of size 100 can be done a thousand times.
    • The code graphs the results, such as a vertical line (ax.vline) at the population mean.
    • The distribution of the means of these 100 samples (out of 1000 trials) is much closer to that of a normal distribution than the initial temperature distribution.
    • This is as expected because of the Central Limit Theorem, which says that the distribution of the means will be approximately normal regardless of the shape of the population distribution.
    • The sample means' mean is 16.3, which is quite close to the population mean.
    • The standard deviation of the sample means is 0.94.
    • This establishes that the similarity between a sample mean and the population mean is to be expected, not a coincidence.

      Confidence Intervals and Reducing Variability (13:56)

      A 95% confidence interval for the population mean, using the thousand samples, is calculated as the sample mean plus or minus 1.96 times the standard deviation of the sample means (16.28 ± 1.96 * 0.94).

      • This provides a confidence interval from 14.5 to 18.1 degrees, which is a "pretty big range".
      • For an even tighter bound, one may want to draw more samples or draw larger samples.
      • Experiment indicates that drawing more samples (e.g., 2000 vs. 1000) hardly alters the standard deviation of the sample means.
      • Experiment shows that taking larger samples (e.g., size 200 instead of 100) dramatically reduces the standard deviation of the sample means (from 0.94 to 0.66).

       Visualizing Variability with Error Bars (16:52)

      • Variability of data is usually visualized with error bars.
      • Example from literature plots pulse rate vs. frequency of exercise with error bars showing 95% confidence intervals.
      • If the confidence intervals do not overlap, then the means are statistically significantly different (at the 95% level).
      • If confidence intervals do overlap, you cannot conclude that there is no statistically significant difference; more tests might be required.
      • Error bars for the temperature illustration can be graphed with pyplot.errorbar.
      • The plot illustrates the mean against sample size with error bars.
      • As sample size increases, the error bars get smaller, indicating more confidence in the estimate.

      The Standard Error (23:38)

      To take 100 samples of size 600 (60,000 examples total) when the population is only 422,000 examples is wasteful; one might as well examine the entire population.

      • The issue is what can be inferred from a single sample, such as in political surveys.
      • The Central Limit Theorem is useful here, specifically its third component: the sample mean variance will approximate the population variance divided by the sample size.
      • This gives rise to the idea of the Standard Error (SE), technically the standard error of the mean.
      • The equation for standard error is sigma / sqrt(n), where sigma is the population standard deviation and n is the sample size.
      • An experiment graphing the standard deviation of 50 sample means against the standard error of the mean for various sample sizes illustrates that they follow each other quite closely.
      • This means that the standard error, as computed from a single sample, can estimate how the standard deviation of sample means would be if several samples were drawn.
      • Difference between Standard Deviation and Standard Error: Standard deviation measures variation in a set of sample means; standard error is calculated from one sample to approximate what that standard deviation of means would be.

      Estimating Standard Error (31:36)

      The catch: The formula for the standard error involves knowing the population standard deviation (sigma). If you have access to sigma, you might as well not sample.

      • Solution: Make the standard deviation of the sample a stand-in for the population standard deviation.
      • A comparison experiment between sample standard deviation and population standard deviation for various sample sizes indicates that sample standard deviation is a surprisingly good estimator for the population standard deviation when the sample size is large enough (e.g., approximately 500 for the temperature data).

      Factors Influencing Sample Size Required (34:30)

      • Is the sample standard deviation a good estimate only for the temperature data?
      • Two factors to be considered: the population distribution and the population size.
      • Population Distribution: Uniform, normal, and exponential distribution experiments reveal that the sample and population standard deviation difference is different.
      • Skew: The asymmetry of the distribution impacts the sample size needed to obtain a reliable estimate of the population standard deviation from the sample standard deviation.
      • Size of the Population: Surprisingly, the population size does not matter for the sample size required to estimate the mean.

      Procedure for Inference from a Single Sample (39:55)

      To estimate the mean of a population from a single sample:

      1. Get some estimate of the skew in the population to decide on an appropriate sample size.
      2. Select a random sample of that number from the population.
      3. Calculate the mean and standard deviation 

        Procedure for Inference from a Single Sample

        To estimate the mean of a population from a single sample:

        1. Get an estimate of the skew in the population to decide on an appropriate sample size. Choosing too small a size leads to inaccurate answers; economists want the smallest size that gives an accurate answer.
        2. Select a random sample of that number from the population.
        3. Calculate the mean and standard deviation of the sample.
        4. Apply the sample standard deviation to approximate the standard error (it's an approximation, not the actual standard error). It's a good approximation if the sample size was selected properly.
        5. Apply the estimated standard error to create confidence intervals for the sample mean.

        Trying Out the Procedure and Role of Independent Samples

        The procedure is valid if independent random samples are selected. Non-independent samples produce incorrect results. For example, sampling consecutive data points from the temperatures dataset would give data only from one city.

        • An experiment checks whether 200 samples are "sufficient" for the temperature data.
        • The experiment takes random samples of size 200, calculates the sample mean and approximate standard error (using sample standard deviation), and verifies whether the population mean lies outside the 95% confidence interval (more than 1.96 standard errors from the sample mean).
        • The anticipated proportion of times the population mean ought to lie outside the 95% confidence interval is 0.05 (5%). If the outcome is excessively good (e.g., 0%), then the method is too conservative; if it is significantly away from 5%, then the mathematics is probably incorrect.
        • The experiment run gives a fraction well outside of 0.0511, and close to the anticipated 5%.
        • This is reassuring that the approach of taking a single sample, calculating its mean and standard deviation, approximating the standard error from the sample standard deviation, and applying that to do confidence intervals is correct.
        • The concept of standard error is very crucial.
        • The second subject will be appropriate curves to experimental data.