This establishes that the similarity between a sample mean and the population mean is to be expected, not a coincidence.Confidence Intervals and Reducing Variability (13:56)
A 95% confidence interval for the population mean, using the thousand samples, is calculated as the sample mean plus or minus 1.96 times the standard deviation of the sample means (16.28 ± 1.96 * 0.94).
- This provides a confidence interval from 14.5 to 18.1 degrees, which is a "pretty big range".
- For an even tighter bound, one may want to draw more samples or draw larger samples.
- Experiment indicates that drawing more samples (e.g., 2000 vs. 1000) hardly alters the standard deviation of the sample means.
- Experiment shows that taking larger samples (e.g., size 200 instead of 100) dramatically reduces the standard deviation of the sample means (from 0.94 to 0.66).
Visualizing Variability with Error Bars (16:52)
- Variability of data is usually visualized with error bars.
- Example from literature plots pulse rate vs. frequency of exercise with error bars showing 95% confidence intervals.
- If the confidence intervals do not overlap, then the means are statistically significantly different (at the 95% level).
- If confidence intervals do overlap, you cannot conclude that there is no statistically significant difference; more tests might be required.
- Error bars for the temperature illustration can be graphed with
pyplot.errorbar
.
- The plot illustrates the mean against sample size with error bars.
- As sample size increases, the error bars get smaller, indicating more confidence in the estimate.
The Standard Error (23:38)
To take 100 samples of size 600 (60,000 examples total) when the population is only 422,000 examples is wasteful; one might as well examine the entire population.
- The issue is what can be inferred from a single sample, such as in political surveys.
- The Central Limit Theorem is useful here, specifically its third component: the sample mean variance will approximate the population variance divided by the sample size.
- This gives rise to the idea of the Standard Error (SE), technically the standard error of the mean.
- The equation for standard error is sigma / sqrt(n), where sigma is the population standard deviation and n is the sample size.
- An experiment graphing the standard deviation of 50 sample means against the standard error of the mean for various sample sizes illustrates that they follow each other quite closely.
- This means that the standard error, as computed from a single sample, can estimate how the standard deviation of sample means would be if several samples were drawn.
- Difference between Standard Deviation and Standard Error: Standard deviation measures variation in a set of sample means; standard error is calculated from one sample to approximate what that standard deviation of means would be.
Estimating Standard Error (31:36)
The catch: The formula for the standard error involves knowing the population standard deviation (sigma). If you have access to sigma, you might as well not sample.
- Solution: Make the standard deviation of the sample a stand-in for the population standard deviation.
- A comparison experiment between sample standard deviation and population standard deviation for various sample sizes indicates that sample standard deviation is a surprisingly good estimator for the population standard deviation when the sample size is large enough (e.g., approximately 500 for the temperature data).
Factors Influencing Sample Size Required (34:30)
- Is the sample standard deviation a good estimate only for the temperature data?
- Two factors to be considered: the population distribution and the population size.
- Population Distribution: Uniform, normal, and exponential distribution experiments reveal that the sample and population standard deviation difference is different.
- Skew: The asymmetry of the distribution impacts the sample size needed to obtain a reliable estimate of the population standard deviation from the sample standard deviation.
- Size of the Population: Surprisingly, the population size does not matter for the sample size required to estimate the mean.
Procedure for Inference from a Single Sample (39:55)
To estimate the mean of a population from a single sample:
- Get some estimate of the skew in the population to decide on an appropriate sample size.
- Select a random sample of that number from the population.
- Calculate the mean and standard deviation
Procedure for Inference from a Single Sample
To estimate the mean of a population from a single sample:
- Get an estimate of the skew in the population to decide on an appropriate sample size. Choosing too small a size leads to inaccurate answers; economists want the smallest size that gives an accurate answer.
- Select a random sample of that number from the population.
- Calculate the mean and standard deviation of the sample.
- Apply the sample standard deviation to approximate the standard error (it's an approximation, not the actual standard error). It's a good approximation if the sample size was selected properly.
- Apply the estimated standard error to create confidence intervals for the sample mean.
Trying Out the Procedure and Role of Independent Samples
The procedure is valid if independent random samples are selected. Non-independent samples produce incorrect results. For example, sampling consecutive data points from the temperatures dataset would give data only from one city.
- An experiment checks whether 200 samples are "sufficient" for the temperature data.
- The experiment takes random samples of size 200, calculates the sample mean and approximate standard error (using sample standard deviation), and verifies whether the population mean lies outside the 95% confidence interval (more than 1.96 standard errors from the sample mean).
- The anticipated proportion of times the population mean ought to lie outside the 95% confidence interval is 0.05 (5%). If the outcome is excessively good (e.g., 0%), then the method is too conservative; if it is significantly away from 5%, then the mathematics is probably incorrect.
- The experiment run gives a fraction well outside of 0.0511, and close to the anticipated 5%.
- This is reassuring that the approach of taking a single sample, calculating its mean and standard deviation, approximating the standard error from the sample standard deviation, and applying that to do confidence intervals is correct.
- The concept of standard error is very crucial.
- The second subject will be appropriate curves to experimental data.