I’ve written previously about the probability of finding a shared birthday in a room full of people. I wanted to run some simulations on this using Python. As an aside, you’ll find some of the techniques below bear a similarity to that old interview question of finding if a list has duplicate elements.
The following code generates birthday samples for a given number of people, then works out the proportion of these samples that contain at least one shared birthday:
import numpy def sharedbday1(people,simulations=50000): samples = numpy.random.random_integers(0,365, size=(simulations, people)) hasnoshared=map(lambda x: len(x)==len(set(x)),samples) return 1-float(sum(hasnoshared))/simulations
Timing this in IPython on my laptop for 50k simulations takes about .5 secs to run. For 23 people, it returns the expected result of around .5. As we saw in a previous post on this blog, it takes 23 people in a room to have at least a 50% chance of a shared birthday.
In [579]: time sharedbday1(23) CPU times: user 0.56 s, sys: 0.00 s, total: 0.56 s Wall time: 0.56 s Out[580]: 0.50678000000000001
Since we love Numpy, it’s likely we’ll have a cleverer and faster way of doing this. We can sort the array, then compare it element-wise against a slice of itself offset by one – and if we find any equivalent elements we’ve found a sample with a shared birthday. With Numpy arrays this is achieved very neatly as operations in Numpy are performed element-wise by default:
import numpy def sharedbday2(people, simulations=50000): samples = numpy.random.random_integers(0,365, size=(simulations, people)) samples.sort() hasshared = ((sample[1:]==sample[:-1]).any() for sample in samples) return float(sum(hasshared))/simulations
Well, unfortunately and somewhat surprisingly, this runs twice as slow as the original code:
In [582]: time sharedbday2(23) CPU times: user 1.01 s, sys: 0.00 s, total: 1.01 s Wall time: 1.02 s Out[583]: 0.50988
I exchanged some emails with Travis Oliphant of Numpy fame, who suggested the following code which is really neat and worth spending time to understand – I think it shows off the real power of Numpy:
import numpy def sharedbday3(people, simulations=50000): samples = numpy.random.random_integers(0,365, size=(simulations, people)) samples.sort(axis=-1) hasshared = (samples[:,1:] == samples[:,:-1]).any(axis=-1) return float(sum(hasshared))/simulations
This is fast and shows a speed-up of least 5X on my original code – with a wall time of .09s:
In [585]: time sharedbday3(23) CPU times: user 0.10 s, sys: 0.00 s, total: 0.10 s Wall time: 0.09 s Out[586]: 0.50119999999999998
As Travis explains it:
You create a 2-d array up-front with the simulations along one axis (i.e. every row of the 2-d array is a separate simulation). Then you can use the sample approach used above, but this time letting NumPy do the loop over the rows (down in C). The sort and any methods on NumPy arrays take an axis keyword which determines the dimension over which the operation works. The operation is then repeated implicitly over the other dimensions. So, in the 3rd version shared birthday there are no for loops. The penalty paid is memory usage: a 2-d array instead of throw-away 1-d arrays. But, of course, less memory management is also a major reason for the speed gain.
For completeness, I made some minor modifications to the original code above to simulate the probability that at least one person in a group shares your birthday (rather than of finding a shared birthday between them).
import numpy def sharemybday(people,simulations=50000): mybday=0 samples = numpy.random.random_integers(0,365, size=(simulations, people)) hasshared=map(lambda x: mybday in x,samples) return float(sum(hasshared))/simulations
In [1509]: sharemybday(253) Out[1509]: 0.50431999999999999