Molis Hai: generating Passwords using Charles Dickens

:: Racket, Security

By: John Clements

TL;DR: Molis Hai

Randomly generated passwords:

  • dMbcGp=A(
  • 9eMRV7N%[
  • R]eJxx68v
  • GVUN#ek5z
  • ms8AG09-h
  • sVh2TT4wx
  • Y]sa7-b(f
  • BOrnNGLqk

More randomly generated passwords:

  • wargestood hury on,
  • wealenerity," stp
  • twould, aftilled himenu
  • Whaideve awasaga
  • andir her hing ples. F
  • spe it humphadeas a
  • to and ling, ace upooke,
  • Mr. Syd, why.’ tred. "D

Yet More randomly generated passwords

  • brothe aponder and," reasun
  • ther atternal telle is be
  • his me, he foundred, id
  • allant our faces of rai
  • time! What it of vail
  • sourned," reate." Manybody.
  • they would reck," read-doom
  • raise thack ther meant,

Which of these look easiest to remember?

All three of these sets of passwords are randomly generated from a set of 2^56; they’re all equivalently secure. The second ones and the third ones are generated using markov models built from the text of Charles Dickens’ A Tale Of Two Cities, where transitions are made using Huffman Trees.

The secret sauce here is that since traversing a Huffman tree to a common leaf requires fewer bits than traversing that same tree to reach a deep leaf, we can drive the generating model using a pool of bits, and use varying numbers of bits depending on the likelihood of the taken transition.

This means that there’s a 1-to–1 mapping between the sequences of bits and the corresponding English-like textual fragments, thus guaranteeing the security of the passwords (or, more precisely, reducing it to the problem of generating a cryptographically secure sequence of bits, which many smart people have thought hard about already).

Another reasonable way to describe this process is that we’re just “decompressing” randomly generated crypto bits using a model trained on Dickens.

The only difference between the second and third pools is that the second one uses a 2nd-order markov model—meaning that the choice of a letter is driven by the prior 2—and that the third one uses a 3rd-order model, resulting in more Dickensian text—but also in longer text.

Naturally, you can push this further. When you get to a 5th order model, you get passwords like this:

  • not bitter their eyes, armed; I am natural
  • me. Is that. At fire, and, and—in separable;
  • reason off. The nailed abound tumbril o
  • and many more." “See, that,” return-
  • falls, any papers over these listen
  • do you, yes." "I beg to takes merc
  • paper movement off before," said, Charles," rejoin
  • that. She—had the season flung found." He o

Much more Dickensian, much longer. Same security.

You can try it out yourself; Molis Hai contains a small JS implementation of this, and a canned set of 2nd-order trees.

Please note that there’s nothing secret about the model; we’re assuming that an attacker already knows exactly how you’re generating your passwords. The only thing he or she is missing is the 56 bits you used to generate your password.

For a more carefully written paper that explains this a bit more slowly, see the preprint at ArXiv.

Naturally, you can use any corpus you like. I tried generating text using a big slab of my own e-mails, and aside from a serious tendency to follow the letter “J” with the letters “o”, “h”, and “n”, I didn’t notice a huge difference, at least not in the 2nd-order models. Well, actually, here’s an example:

  • 0.91, Also: Zahid We rigor
  • argustorigoring tent r
  • Myrics args foling") (
  • can’s fortalk at html-unds
  • having avaScript" 0.88489232B
  • John? I doe.cal fluore let a
  • botheird, creally, there thic
  • to ind [(solutell wil

It’s probably true that Charles Dickens wasn’t quite so likely to type “avascript” as I am. Or “html”.

To read the Racket code I used to generate the models, see github.

And for Heaven’s sake, let me know about related work that I missed!