Statistics for Hackers

There's no shortage of absolutely magnificent material out there on the topics of data science and machine learning for an autodidact, such as myself, to learn from. In fact, so many great resources exist that an individual can be forgiven for not knowing where to begin their studies, or for getting distracted once they're off the starting block. I honestly can't count the number of times that I've started working through many of these online courses and tutorials only to have my attention stolen by one of the multitudes of amazing articles on data analysis with python, or some great new MOOC on Deep Learning. But this year is different! This year, for one of my new year's resolutions, I've decided to create a personalized data science curriculum and stick to it. This year, I promise not to just casually sign up for another course, or start reading yet another textbook to be distracted part way through. This year, I'm sticking to the plan.

As part of my personalized program of study, I've chosen to start with Harvard's Data Science course . I'm currently on week 3 and one of the suggested readings for this week is Jake VanderPlas' talk from PyCon 2016 titled "Statistics for Hackers". As I was watching the video and following along with the slides , I wanted to try out some of the examples and create a set of notes that I could refer to later, so I figured why not create a Jupyter notebook. Once I'd finished, I realized I'd created a decently-sized resource that could be of use to others working their way through the talk. The result is the article you're reading right now, the remainder of which contains my notes and code examples for Jake's excellent talk.

So, enjoy the article, I hope you find this resource useful, and if you have any problems or suggestions of any kind, the full notebook can be found on github , so please send me a pull request , or submit an issue , or just message me directly on Twitter .

In[2]: import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns # Suppress all warnings just to keep the notebook nice and clean. # This must happen after all imports since numpy actually adds its # RankWarning class back in. import warnings warnings.filterwarnings("ignore") # Setup the look and feel of the notebook sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5}) sns.set_style('whitegrid') sns.set_palette('deep') # Create a couple of colors to use throughout the notebook red = sns.xkcd_rgb['vermillion'] blue = sns.xkcd_rgb['dark sky blue'] from IPython.display import display %matplotlib inline %config InlineBackend.figure_format = 'retina'

The talk starts off with a motivating example that asks the question "If you toss a coin 30 times and see 22 heads, is it a fair coin?"

We all know that a fair coin should come up heads roughly 15 out of 30 tosses, give or take, so it does seem unlikely to see so many heads. However, the skeptic might argue that even a fair coin could show 22 heads in 30 tosses from time-to-time. This could just be a chance event. So, the question would then be "how can you determine if you're tossing a fair coin?"

The Classic Method

The classic method would assume that the skeptic is correct and would then test the hypothesis (i.e., the Null Hypothesis ) that the observation of 22 heads in 30 tosses could happen simply by chance. Let's start by first considering the probability of a single coin flip coming up heads and work our way up to 22 out of 30.

As our equation shows, the probability of a single coin toss turning up heads is exactly 50% since there is an equal chance of either heads or tails turning up. Taking this one step further, to determine the probability of getting 2 heads in a row with 2 coin tosses, we would need to multiply the probability of getting heads by the probability of getting heads again since the two events are independent of one another.

From the equation above, we can see that the probability of getting 2 heads in a row from a total of 2 coin tosses is 25%. Let's now take a look at a slightly different scenario and calculate the probability of getting 2 heads and 1 tails with 3 coin tosses.

The equation above tells us that the probability of getting 2 heads and 1 tails in 3 tosses is 12.5%. This is actually the exact same probability as getting heads in all three tosses, which doesn't sound quite right. The problem is that we've only calculated the probability for a single permutation of 2 heads and 1 tails; specifically for the scenario where we only see tails on the third toss. To get the actual probability of tossing 2 heads and 1 tails we will have to add the probabilities for all of the possible permutations, of which there are exactly three: HHT, HTH, and THH.

Another way we could do this is to calculate the total number of permutations and simply multiply that by the probability of each event happening. To get the total number of permutations we can use the binomial coefficient . Then, we can simply calculate the probability above using the following equation.

While the equation above works in our particular case, where each event has an equal probability of happening, it will run into trouble with events that have an unequal chance of taking place. To deal with those situations, you'll want to extend the last equation to take into account the differing probabilities. The result would be the following equation, where $N$ is number of coin flips, $N_H$ is the number of expected heads, $N_T$ is the number of expected tails, and $P_H$ is the probability of getting heads on each flip.

Now that we understand the classic method, let's use it to test our null hypothesis that we are actually tossing a fair coin, and that this is just a chance occurrence. The following code implements the equations we've just discussed above.

In[2]:

def factorial(n): """Calculates the factorial of `n` """ vals = list(range(1, n + 1)) if len(vals) <= 0: return 1 prod = 1 for val in vals: prod *= val return prod def n_choose_k(n, k): """Calculates the binomial coefficient """ return factorial(n) / (factorial(k) * factorial(n - k)) def binom_prob(n, k, p): """Returns the probability of see `k` heads in `n` coin tosses Arguments: n - number of trials k - number of trials in which an event took place p - probability of an event happening """ return n_choose_k(n, k) * p**k * (1 - p)**(n - k)

Now that we have a meth

Latest Images

Trending Articles

Latest Images