hash table timings

:: Racket

By: John Clements

Are immutable hash tables constant-time insert and lookup? Actually, it looks like the answer is “no, not really”:

This picture shows insertion time per insertion, so we’d expect to see something like a flat line. Of course, there’s a lot of hand-waving involving “as n approaches infinity” and what that actually means on a machine with limited memory, but it seems clear from this picture that if you’re creating a hash table with about ten million elements, you should probably consider using a mutable hash table; the time required to create that table will be about 40 seconds for an immutable hash table vs. about 7 seconds for a mutable one.

This graph strongly suggests that the added time is entirely GC time, assuming that GC is not running in parallel, which I believe it’s not.

Drawing this on a linear-x graph suggests that the times for the immutable hash table insertion are well below linear; I’m thinking they’re somewhere between log n and (log n)^2.

What about lookup? Well, I ran two experiments; one on keys that are in the table, and one on keys that are not.

In this experiment, there are 3 million lookups on each tree size. The numbers for both of them are interesting in that they move around quite a lot; the error bars aren’t that big, and you can see (especially in the failed lookups) that there are some definite patterns. First of all, the immutable tree lookups pretty clearly display an upward trend, suggesting that lookup is actually log(n), albeit with a fairly small constant (about 20% per 10x). The lookups on the mutable hash tables also seem to be growing, though in their case there seems to be a sawtooth pattern, presumably occurring when the size of the table passes a particular threshold.

In the case of lookups, though, unlike insertion, there’s no clear mutable-vs-immutable winner, at least for the table sizes that I used. Lookups are generally around 150 microseconds, compared to about 600 microseconds to insert into a mutable hash table.

Details: The keys for the table creation are integers in the range 30,000,000 ^ 2, generated in a way that guaranteed uniqueness. The values associated with the keys are just the integers in the range 0..table size.

Details: I ran these tests from the command-line, with 10 trials at each size, using a Macbook Pro (mid 2015) with a 2.5 GHz i7. Error bars show two standard deviations, indicating a 95% boundary assuming the distribution is normal. All non-GC times are wall-clock times. All runs used the same random seed.

I deleted one trial from one set of tests, because the machine fell asleep and doubled the wall-clock time.

These are all using untyped Racket, version—2017–04–20. Ooh, that’s a bit out of date.. I honestly don’t know what TR would do to these runtimes.

Finally, in case you’re interested, here’s the source code I used to generate these timings (running them will write files into /tmp/, JFYI):

Okay, enough messing around. Back to grading….