Python Tutorial: Central limit theorem

Want to learn more? Take the full course at https://learn.datacamp.com/courses/practicing-statistics-interview-questions-in-python at your own pace. More than a video, you’ll learn hands-on coding & quickly apply skills to your daily work.

We’ve established a solid base with conditional probabilities. Now, let’s get into central limit theorem or CLT: what it is, why it’s important, and how to visualize it in python.

The central limit theorem says that with a large enough collection of samples from the same population, the sample means will be normally distributed.

Note that this doesn’t make any assumptions about the underlying distribution of the data; with a reasonably large sample of roughly 30 or more, this theorem will always ring true no matter what the population looks like.

Central limit theorem matters because it promises our sampling mean distribution will be normal, therefore we can perform hypothesis tests. More concretely, we can assess the likelihood that a given mean came from a particular distribution and then, based on this, reject or fail to reject our hypothesis. This empowers all of the A/B testings you see in practice.
For this reason, interviewers love this topic. Be sure to have a well-thought-out answer prepared.

It’s also worth mentioning that this is different than the law of large numbers. The law of large numbers states that as the size of a sample is increased, the estimate of the sample mean will more accurately reflect the population mean.

We see this here with the purple, red, and gold distributions representing small, medium, and large samples, respectively. This is different from the central limit theorem, though it’s easy to get mixed up in a high-stress interview setting.

We can run a simulation in python to get the following plot showing rolls of a normal six-sided die. In order to do this, we’ll utilize the numpy randint function where we input the start, end, and number of values that we want to randomly generate, along with the numpy mean function.

The sample means don’t look like much at first here, but they slowly become more and more normal around the true mean of 3-point-5, thanks to the central limit theorem at work. This simple matplotlib histogram shows only rolls 1 through 100, but you can imagine how this would continue if we upped the number of trials.

Before we wrap up, let’s cover list comprehension. List comprehension is a pretty cool python trick that comes in handy for setting up these numpy simulations and certain coding interview questions.

Here you see a snippet of some code that’s designed to take in our list and square each value. List comprehension tightens this up by allowing you to execute your for loop in only one line, giving us the same answer.

Wrapping things up, let’s summarize what we learned. We talked about the central limit theorem, what it is and why it matters, we touched on the law of large numbers, looked at a simulation of CLT in python and finally, went over list comprehension.

Remember, interviewers love central limit theorem, and it’s really fundamental to data science, so it’s worth gaining a certain level of familiarity with the topic.

But enough on CLT for now, let’s get to some coding exercises!

Post Author: hatefull