Python Tutorial: Probability distributions

Want to learn more? Take the full course at https://learn.datacamp.com/courses/practicing-statistics-interview-questions-in-python at your own pace. More than a video, you’ll learn hands-on coding & quickly apply skills to your daily work.

Let’s discuss probability distributions. We’ll review what a probability distribution is exactly, why it’s important, and then hone in on the four distributions that are most common in interviews.

Probability distributions are fundamental to statistics, similar to the way that data structures are to computer science. Simply put, they describe the likelihood of an outcome.

The probabilities must all add up to 1 and can be discrete, like the roll of a die, or continuous, like the amount of rainfall. Here we see an example of a continuous probability distribution where the total area under the curve adds up to 1.

There are hundreds of distributions out there, but only a handful actually turn up in practice. In this course, we’ll address only the most likely to be brought up in your next interview.

These include binomial, Bernoulli, normal, and Poisson. We’ll use the rvs command in scipy to simulate all of these distributions before you visualize them using matplotlib. Let’s talk a bit more about each one.

First up is Bernoulli, a discrete distribution that models the probability of two outcomes. Here we see the results of a coin flip, a common Bernoulli example. Both heads and tails have the same probability of 0 point 5, so the values are even in this sample.

Since there are only two possible outcomes in Bernoulli, the probability of one is always 1 minus the probability of the other.

Next up is the Binomial distribution, which can be thought of as the sum of the outcomes of multiple Bernoulli trials, meaning those that have established success and failure.

It’s used to model the number of successful outcomes in trials where there is some consistent probability of success. These parameters are often referred to ask, the number of successes, n, the number of trials, and p, the probability of success. You can input these parameters into the CDF and pmf functions in python.

Here we see results of a sample representing the number of heads in two consecutive coin flips using a fair coin, taking the form of a binomial distribution.

We talked a little about normal distribution when we worked through the central limit theorem, but it’s well worth its own slide here.

The normal distribution is a bell-curve shaped continuous probability distribution that is fundamental to many statistics concepts, like sampling and hypothesis testing.

Here we see the normal distribution with numbers overlaid that serve as a reminder of the 68-95-99 point 7 rule, which says that approximately 68 percent of observations fall within 1 standard deviation of the mean, 95 percent of observations within 2 standard deviations, and 99 point 7 percent within 3 deviations. It’s good to have this memorized.

Like the binomial distribution, the Poisson distribution represents a count or the number of times something happened. It’s calculated not by a probability p and number of trials n, but by an average rate shown by lambda.
Here, we can see a few Poisson curves given different values of lambda. As the rate of events changes, the distribution changes as well.

Poisson is the way to go for counting events over time given some continuous rate. In this example, you’re given a time interval and a rate. What’s the probability you see at least one shooting star in an hour?

To summarize, we touched on what probability distributions are, went over common distribution types, and then dove into a few notable distributions more in-depth.

Now let’s work through some exercises!

Post Author: hatefull