Today I saw this post on boingboing with a method to generate a random number from 1 to 6 without the need of a die. The basic method is this:
- Pick a random word, using any method you like.
- Sum up the number values of all the letters in the word, with a=1, b=2, etc.
- Calculate
sum%9
ormod(sum,9)
, equivalent to calculating the remainder after dividing the sum by 9. - If the number is 0 or greater than 6, throw it out and go to the next word.
- Otherwise, you have your pseudo-random number from 1 to 6.
I thought it was pretty interesting, but the first thing I thought of was: why divide by 9? You’ll have to throw out about 1/3 of your numbers because the remainder will be 0, 7, or 8. Why not divide by 6 instead? Then you can just use a mapping of 0→1, 1→2, and so on to 5→6 and keep all the numbers without having to throw out anything. But the real challenge is to then take it to the next level and write some code to do it for you.
In the comments very soon somebody posted some results using the words from their Words file (located at /usr/share/dict/words
, it’s a file with a bunch of words used by spell-checkers in the linux OS) with numbers that looked very good for for the mod(9) and mod(6) versions.
I decided that I could do better than that. Why not pull a bunch of words from the internet and use those? There’s no better source of lots of words than free books, and the easiest way to get free books is from Project Gutenberg.
I decided to try writing some code in python, and a quick google search got me some simple code to pull text from any Project Gutenberg book and divide it into a list of individual strings for each word. After that, it’s a fairly straightforward process to iterate through each word, use the ord()
function to get the ASCII value and convert it to an a=1, b=2, etc. number encoding, sum the numbers up to get a total, and then use the mod()
function to get the remainder by dividing it by 6 or 9. For comparison, I also used the random number generator in python to generate an equal number of random numbers between 1 to 6 as well.
I did that for all the words in Crime and Punishment by Fyodor Dostoevsky and got the following results:
method 1 2 3 4 5 6
--------------- ----- ----- ----- ----- ----- -----
mod(N,6) method 24500 43816 23020 29030 23315 15064
mod(N,9) method 15804 20820 14081 14960 19924 12867
rand function 26610 26573 26465 26457 26359 26281
Showing as a plot we get the following:
That’s actually a really bad distribution, there’s a whole lot more 2’s and less 6’s, both for the mod(6)
an mod(9)
calculations, but it’s especially bad for the mod(6)
, whereas mod(9)
has too many 2’s and 5’s.
Why would that be? Someone in the comments had a good suggestion: if I’m just using every word (of three letters or longer, I dropped any one- or two-letter words), then there are probably still a whole lot of the‘s and and‘s in the list that would throw off my distribution.
Let’s do a quick calculation and see. For the we have t=20
, h=8
, and e=5
. That gives a total of 20+8+5=33
. Dividing 33/9 gives a remainder of 6, and dividing 33/6 gives a remainder of 3. For the mod9 method that number will be dropped, since I just kept results of 0 to 5 (mapping to numbers 1 to 6) and dropped the rest. For mod9 that number will become 4, using the mapping 3 → 4.
Similarly for and we have a=1
, n=14
, and d=4
. That gives a total of 19, and dividing 19/9 gives a remainder of 1, and dividing 19/6 gives the same remainder of 1, since 18 is a common multiple forth 6 and 9. In both cases this will become a 2 on a die, since we’re using the mapping 1 → 2. And looking at the plot, in both cases 2 has the highest number and 4 after that. This is the most likely reason.
So I then fixed the script so that it ignored any word repetitions, so each word is represented only once in the entire word list. Re-running it gave me the following results for Crime and Punishment:
method 1 2 3 4 5 6
--------------- ---- ---- ---- ---- ---- ----
mod(N,6) method 1584 1600 1551 1610 1493 1526
mod(N,9) method 1052 1057 1069 1084 1049 1035
rand function 1522 1556 1561 1643 1568 1514
And as a plot:
That looks a whole lot better. There are less overall using the mod9 function because it drops about 1/3 of the numbers because the remainder > 5. Overall though, it certainly passes the ‘looks OK’ test (which is important in science and engineering, don’t underestimate it), but can we more rigorously determine how random it is?
There is a statistical test we can perform called the Pearson’s chi-squared test. Basically how it works is this: if the process is perfectly random, then we can expect to have the exact same number of dice rolls for each number: namely the number of times 1, 2, 3, 4, 5, and 6 should all be the same. That is our expected value. We take the actual number of times each number is rolled, subtract it from the expected value, square that quantity, and then divide by the expected value. Then we do the same for all six numbers and sum the result. This is our chi-squared statistic or \(\chi^2\) , which should be closer to zero as the randomness of our method increases. Results from Crime And Punishment:
Method | \(\mathbf{\chi^2}\) |
mod(6) | 6.66 |
mod(9) | 1.36 |
rand() | 7.81 |
The \(\chi^2\) value from the rand()
function varies since it uses different random numbers each time, usually it varies from 2 to 7. This puts the results from the mod(6)
and the mod(9)
functions on par with
the rand()
function.
However if we do the same analysis on the results when I kept all the repeated words, we get very different \(\chi^2\) values:
Method | \(\mathbf{\chi^2}\) |
mod(6) | 17510 |
mod(9) | 3184 |
Obviously this is much much worse than when we improved the code by removing all repeating words, but quantitatively speaking, how much worse is it. Or in other words, what do those values of 17510 and 5184
actually mean compared to 6.66 and 1.36?
To answer that we have to look at the chi-squared distribution itself. There is more than you could ever want to learn about it on the wikipedia articles for the chi-squared distribution and the Pearson’s chi-squared test, but basically by using that distribution it allows us to determine values that we can compare our \(\chi^2\) values to in order to see how random our numbers are.
First we define the null hypothesis, which is that our pseudo-random numbers generated using the word-to-number-to-divide-by-number-and keep-the-remainder method are random. We calculate a confidence interval (i.e. 90%, 95%, 99%, 99.99%, etc.) using the chi-squared distribution function for a given number of degrees of freedom (5 in this case, 6 possibilities – 1) and compare our \(\chi^2\) values to it. If our \(\chi^2\) is greater than the confidence interval, then we reject the null hypothesis and so we can say that our numbers are not random to the degree of our confidence.
Let’s do an example. For 5 degrees of freedom, the value for a confidence interval of 99% is 15.0863. We saw above the the \(\chi^2\) values for using all the repeating words were 17510 and 5184 for the mod6 and mod9 methods respectively. Since both of these values are greater than 15.0863, then we reject the null hypothesis, and so we can say with a 99% confidence that those numbers are not random.
However if the numbers are less than our critical value, as is the case when we don’t repeat numbers, does the converse hold? Can we then say with a 99% confidence that our numbers are random? Actually, we can’t. A hypothesis can never be proven, it can only be disproven. So in this case we don’t reject the hypothesis. We haven’t proven the numbers are random, we simply say that we can’t prove they aren’t random.
This becomes clearer when we start comparing different confidence values. For the same 5 degrees of freedom, we have the following confidence intervals:
Confidence level | Critical \(\mathbf{\chi^2}\) value |
50% | 4.35146 |
90% | 9.23636 |
99% | 15.0863 |
99.9% | 20.515 |
99.99% | 25.7448 |
To disprove the null hypothesis with a higher confidence level, that is to have a higher degree of confidence that our numbers are not random, requires a higher \(\chi^2\) value. So for example if our \(\chi^2=21\), we are 99.9% confident that the numbers are not random, but we can’t be 99.99% confident. There is still a 0.1% chance that the numbers are in fact random, just that by random chance we happened to get a bunch of the same numbers giving us a high \(\chi^2\) value. That’s where the uncertainty is.
If the converse were true, then if our numbers had a \(\chi^2=21\), then we could say with a 99.99% confidence that our numbers were random, since 21<25.7448. But because 21>20.515 for the 99.9% confidence interval, we know with at 99.9% confidence that the numbers are in fact not random, so also trying to say that we have a 99.99% confidence that they are random contradicts this. Hence we can never truly prove they are random, we can only prove or fail to prove that they are not random to a given degree of confidence.
However going back to our results, the \(\chi^2\) values of 17510 and 3184 are much higher than even a 99.9999999999999999% confidence limit. So we can be pretty much absolutely certain that they are not random. In fact the highest confidence limit I could even calculate was a confidence level of 99.999….(300 9’s)%, which has a value of about 1400. We’re even greater than that!
Anyway, I’ve added the code I wrote to github here. Feel free to play with it if you like.