Molis Hai: generating Passwords using Charles Dickens

2015-03-24 :: Racket, Security

By: John Clements

TL;DR: Molis Hai

Randomly generated passwords:

dMbcGp=A(
9eMRV7N%[
R]eJxx68v
GVUN#ek5z
ms8AG09-h
sVh2TT4wx
Y]sa7-b(f
BOrnNGLqk

More randomly generated passwords:

wargestood hury on,
wealenerity," stp
twould, aftilled himenu
Whaideve awasaga
andir her hing ples. F
spe it humphadeas a
to and ling, ace upooke,
Mr. Syd, why.’ tred. "D

Yet More randomly generated passwords

brothe aponder and," reasun
ther atternal telle is be
his me, he foundred, id
allant our faces of rai
time! What it of vail
sourned," reate." Manybody.
they would reck," read-doom
raise thack ther meant,

Which of these look easiest to remember?

All three of these sets of passwords are randomly generated from a set of 2^56; they’re all equivalently secure. The second ones and the third ones are generated using markov models built from the text of Charles Dickens’ A Tale Of Two Cities, where transitions are made using Huffman Trees.

The secret sauce here is that since traversing a Huffman tree to a common leaf requires fewer bits than traversing that same tree to reach a deep leaf, we can drive the generating model using a pool of bits, and use varying numbers of bits depending on the likelihood of the taken transition.

This means that there’s a 1-to–1 mapping between the sequences of bits and the corresponding English-like textual fragments, thus guaranteeing the security of the passwords (or, more precisely, reducing it to the problem of generating a cryptographically secure sequence of bits, which many smart people have thought hard about already).

Another reasonable way to describe this process is that we’re just “decompressing” randomly generated crypto bits using a model trained on Dickens.

The only difference between the second and third pools is that the second one uses a 2nd-order markov model—meaning that the choice of a letter is driven by the prior 2—and that the third one uses a 3rd-order model, resulting in more Dickensian text—but also in longer text.

Naturally, you can push this further. When you get to a 5th order model, you get passwords like this:

not bitter their eyes, armed; I am natural
me. Is that. At fire, and, and—in separable;
reason off. The nailed abound tumbril o
and many more." “See, that,” return-
falls, any papers over these listen
do you, yes." "I beg to takes merc
paper movement off before," said, Charles," rejoin
that. She—had the season flung found." He o

Much more Dickensian, much longer. Same security.

You can try it out yourself; Molis Hai contains a small JS implementation of this, and a canned set of 2nd-order trees.

Please note that there’s nothing secret about the model; we’re assuming that an attacker already knows exactly how you’re generating your passwords. The only thing he or she is missing is the 56 bits you used to generate your password.

For a more carefully written paper that explains this a bit more slowly, see the preprint at ArXiv.

Naturally, you can use any corpus you like. I tried generating text using a big slab of my own e-mails, and aside from a serious tendency to follow the letter “J” with the letters “o”, “h”, and “n”, I didn’t notice a huge difference, at least not in the 2nd-order models. Well, actually, here’s an example:

0.91, Also: Zahid We rigor
argustorigoring tent r
Myrics args foling") (
can’s fortalk at html-unds
having avaScript" 0.88489232B
John? I doe.cal fluore let a
botheird, creally, there thic
to ind [(solutell wil

It’s probably true that Charles Dickens wasn’t quite so likely to type “avascript” as I am. Or “html”.

To read the Racket code I used to generate the models, see github.

And for Heaven’s sake, let me know about related work that I missed!