SMHasher is a popular test suite created by Austin Appleby for testing hash functions for weaknesses and speed. There is also a more up-to-date fork.
SMHasher is widely used in the design and vetting of non-cryptographic hash functions. And non-cryptographic hash functions often advertise that they pass the entire SMHasher test suite as a signal of hash quality.
This is a problem for hashes with large output sizes, and in this post we're going to explore why that is.
In the process of exploring that, we're also going to:
- Create a 128-bit hash function that cleanly passes SMHasher, while nevertheless having issues that are obvious from analysis.
- Identify a common issue with most currently published large-output non-cryptographic hashes.
Please note that this entire post is about hashes intended for use in non-adversarial conditions. Hashes that need to withstand attacks have additional considerations, and I'm not qualified to discuss that.1
Goodhart's Law
There is a fundamental problem with designing hashes against SMHasher, which basically comes down to Goodhart's law:
When a measure becomes a target, it ceases to be a good measure.
Don't get me wrong, SMHasher is a good test suite, and any large-output hash function worth its salt ought to pass it. But it is infeasible to catch all issues with empirical tests.
Because of this, if your approach to hash design is to keep tweaking your hash until it passes SMHasher, you will merely end up with a hash that is free of the issues that SMHasher can test for.
For hashes with small outputs (e.g. 32 or 64 bits), that's almost certainly fine. The use cases for such hashes (e.g. hash tables) expect and have to handle collisions anyway, so the quality requirements aren't nearly as stringent.
However, 128-bit+ hashes are another story. Large hashes have different use cases, and in many of those use cases (e.g. content-addressed systems, data fingerprinting, etc.) collisions are unacceptable, and can lead to data loss.
Additionally, large hashes are infeasible to thoroughly test for issues empirically due to their size. So if you design a large hash function by just tweaking it until it passes a test suite like SMHasher, it is very likely to still have issues that make collisions more likely than its output size suggests.
To help make my point, we're going to design a 128-bit hash function that has clear issues when properly analyzed, but which nevertheless appears to have best-in-class quality according to SMHasher.
The Mixing Function
To get started, let's choose a mixing function for our hash. Mixing functions are at the heart of many hash function designs, and we'll be putting one at the heart of ours as well.
Mixing functions take a fixed-size chunk of data and mix up its bits. There are some qualities that you generally want out of a mixing function:
- It should be bijective. Or put differently, it should be reversible. Or put even more differently: it shouldn't lose any data.
- It should have good avalanche. Another way of putting this is that every input bit should be well diffused into the output. You can imagine putting a tiny drop of dye into a glass of water, and then mixing it all up to evenly distribute the dye. That's roughly the idea here, but with bits instead of drops of dye.
(There are other criteria that are also useful, but a discussion of them isn't necessary for this post.)
In our case, we're going to use a mixing function inspired by ThreeFish's MIX function. It takes two unsigned 64-bit integers and mixes their bits up across the total 128 bits of both:
rotation_constants = [
12, 39, 21, 13, 32, 11, 24, 53,
17, 27, 57, 13, 50, 8, 52, 8
]
function mix(A, B, rounds):
for i in 0 to rounds:
A += B + 1
B = B.rotate_left(rotation_constant[i]) ^ A
Let's analyze this mixing function.
First, it's definitely bijective. All of the operations are trivially reversible, and we can easily write a function that undoes the mixing:
function unmix(A, B, rounds):
for i in reverse(0 to rounds):
B = (B ^ A).rotate_right(rotation_constant[i])
A -= B + 1
So that's one check mark!
To analyze the avalanche, we can generate an avalanche diagram2, which is essentially a graph of pixels where the vertical axis is the input bit, the horizontal axis is the output bit, and the color of a pixel indicates how much effect that input bit has on that output bit on average.
Here's an avalanche diagram for our mixing function with 1 round:
Black pixels mean that flipping the input bit never affects the output bit, and white means that it always affects it. Grays mean that sometimes there's an effect and sometimes not. Grays represent good diffusion. So 1 round is pretty awful, as it turns out.
Here's 4 and 8 rounds:
As you can see, more rounds improves the avalanche/diffusion. But we're still not quite there. For full avalanche this mix function actually needs 12 rounds, at which point the image becomes an even gray:
So with 12 rounds, this is a pretty good mix function! For our purposes it's also useful that we can tweak its quality to be anything from awful to good—we'll be exploiting that later.
Our First Bad Hash
Using our new mixing function, we're going to first create an awful hash that SMHasher does identify issues with.
We're going to take a block-based approach to hashing our data. The idea is that we split up the input data into equal-sized blocks, and shove them into our hash state one at a time. The last block may not always be large enough, so in that case we pad it out with zeros.
So here's our very first awful hash:
function goodhart_hash_1(input_data):
hash_state = [0, 0] // Two 64-bit integers.
for block in input_data: // 128-bit blocks.
pad_block_if_needed(block)
hash_state[0] ^= block.first_half
hash_state[1] ^= block.second_half
mix(hash_state, 12 rounds)
return hash_state
There are some really bad things about this hash, which we'll get to in a moment. But first, one apparently good property that this does have is its avalanche:
(Note: I've rotated this diagram 90 degrees for easier viewing, so now the input bits are the horizontal axis.)
This graph shows the avalanche of a 512-bit input, and it looks perfect! This means that every input bit affects every output bit with roughly equal probability. So at first glance this seems awesome.
However, putting our hash through SMHasher3 tells a different story (full output):
AppendedZeroesTest | FAIL | |||
Avalanche Tests | Pass | |||
Keyset 'Sparse' Tests | FAIL | |||
Keyset 'Permutation' Tests | FAIL | |||
Keyset 'Window' Tests | Pass | |||
Keyset 'Cyclic' Tests | FAIL | |||
Keyset 'TwoBytes' Tests | FAIL | |||
Keyset 'Text' Tests | Pass | |||
Keyset 'Zeroes' Tests | FAIL | |||
Diff 'Differential' Tests | FAIL | |||
DiffDist 'Differential Distribution' Tests | Pass | |||
MomentChi2 Tests | Pass | |||
Prng Tests | Pass | |||
BIC 'Bit Independence Criteria' Tests | Pass |
AppendedZeroesTest | Pass | |||
Avalanche Tests | Pass | |||
Keyset 'Sparse' Tests | FAIL | |||
Keyset 'Permutation' Tests | FAIL | |||
Keyset 'Window' Tests | Pass | |||
Keyset 'Cyclic' Tests | FAIL | |||
Keyset 'TwoBytes' Tests | FAIL | |||
Keyset 'Text' Tests | Pass | |||
Keyset 'Zeroes' Tests | Pass | |||
Diff 'Differential' Tests | FAIL | |||
DiffDist 'Differential Distribution' Tests | Pass | |||
MomentChi2 Tests | Pass | |||
Prng Tests | Pass | |||
BIC 'Bit Independence Criteria' Tests | Pass |
AppendedZeroesTest | Pass | |||
Avalanche Tests | Pass | |||
Keyset 'Sparse' Tests | Pass | |||
Keyset 'Permutation' Tests | Pass | |||
Keyset 'Window' Tests | Pass | |||
Keyset 'Cyclic' Tests | Pass | |||
Keyset 'TwoBytes' Tests | Pass | |||
Keyset 'Text' Tests | Pass | |||
Keyset 'Zeroes' Tests | Pass | |||
Diff 'Differential' Tests | Pass | |||
DiffDist 'Differential Distribution' Tests | Pass | |||
MomentChi2 Tests | Pass | |||
Prng Tests | Pass | |||
BIC 'Bit Independence Criteria' Tests | Pass |
AppendedZeroesTest | Pass | |||
Avalanche Tests | Pass | |||
Keyset 'Sparse' Tests | FAIL | |||
Keyset 'Permutation' Tests | Pass | |||
Keyset 'Window' Tests | Pass | |||
Keyset 'Cyclic' Tests | Pass | |||
Keyset 'TwoBytes' Tests | Pass | |||
Keyset 'Text' Tests | Pass | |||
Keyset 'Zeroes' Tests | Pass | |||
Diff 'Differential' Tests | Pass | |||
DiffDist 'Differential Distribution' Tests | Pass | |||
MomentChi2 Tests | Pass | |||
Prng Tests | Pass | |||
BIC 'Bit Independence Criteria' Tests | Pass |
AppendedZeroesTest | Pass | |||
Avalanche Tests | Pass | |||
Keyset 'Sparse' Tests | Pass | |||
Keyset 'Permutation' Tests | Pass | |||
Keyset 'Window' Tests | Pass | |||
Keyset 'Cyclic' Tests | Pass | |||
Keyset 'TwoBytes' Tests | Pass | |||
Keyset 'Text' Tests | Pass | |||
Keyset 'Zeroes' Tests | Pass | |||
Diff 'Differential' Tests | Pass | |||
DiffDist 'Differential Distribution' Tests | Pass | |||
MomentChi2 Tests | Pass | |||
Prng Tests | Pass | |||
BIC 'Bit Independence Criteria' Tests | Pass |
AppendedZeroesTest | Pass | |||
Avalanche Tests | Pass | |||
Keyset 'Sparse' Tests | Pass | |||
Keyset 'Permutation' Tests | Pass | |||
Keyset 'Window' Tests | Pass | |||
Keyset 'Cyclic' Tests | Pass | |||
Keyset 'TwoBytes' Tests | Pass | |||
Keyset 'Text' Tests | Pass | |||
Keyset 'Zeroes' Tests | Pass | |||
Diff 'Differential' Tests | Pass | |||
DiffDist 'Differential Distribution' Tests | Pass | |||
MomentChi2 Tests | Pass | |||
Prng Tests | Pass | |||
BIC 'Bit Independence Criteria' Tests | Pass |
Hash | Average Diffusion (random) | Worst Diffusion (random) | Worst Diffusion (structured)6 | Avalanche Diagram |
---|---|---|---|---|
AquaHash | 32 bits | 32 bits | 0 bits | diagram |
CityHash128 / FarmHash128 | 58 bits | 3 bits | 0 bits | diagram |
MeowHash 0.57 | 74 bits | 32 bits | 0 bits | diagram |
MetroHash128 | 29 bits | 3 bits | 0 bits | diagram |
Murmur3 128-bit | 83 bits | 33 bits | 24 bits | diagram |
xxHash3 128-bit | 38 bits | 33 bits | 0 bits | diagram |
A note of caution: this table needs to be interpreted carefully because these hashes have a variety of designs, some of which might make them tolerant of less mixing between input blocks to a greater or lesser extent. Having said that, some of them definitely don't have such a design.
Moreover, as far as I can tell, none of these hashes7 have published analysis showing that their design is, in fact, tolerant of that to an extent that would live up to their 128-bit output size. On the contrary, in many cases the documentation they provide explicitly outlines their reliance on empirical tests (such as SMHasher) for quality verification.
Mind you, I'm not suggesting that these hashes were designed at random without any insight behind them. But that's not the same as being rigorous enough to have confidence in the hash's quality.
In short, without (public) analysis to demonstrate otherwise, it appears that none of these hashes fully live up to their 128-bit size. For many applications that may be fine. But for some it's definitely not.
(Note: if you want to reproduce the results in this table or check my work, the code is available.)
Conclusion
The wrong thing to take away from this post is that SMHasher is somehow a bad test suite. On the contrary, SMHasher is quite a good test suite.
The real take away is that despite SMHasher being a good test suite, it cannot—and will never be able to—reliably catch issues in large-output hashes. And especially given the use cases that large-output hashes often have, that means we shouldn't rely on SMHasher (or any other test suite!) as a stamp of quality for them.
The only way to have real confidence in large-output hashes is through analysis and justification of their design. And if the creator(s) of a large-output hash want other people to use it, it is incumbent upon them to provide that analysis and justification.
If a creator doesn't provide that, don't trust the hash. We should hold our hashes to a higher standard.
Footnotes
-
Moreover, the norms around hashes with security claims are already much more rigorous. So the main thrust of this post, "you should publicly analyze and justify your hash designs", is redundant in that context since it's already expected.
-
Proper avalanche analysis should be done with more than eyeballing a diagram, but diagrams do provide important intuition for what's going on. I don't cover it in this post because it's not important to the point, but I did more than just look at avalanche diagrams to determine the number of rounds needed for full diffusion.
-
Throughout this article we omit SMHasher's seeding tests (e.g. the Perlin Noise test) because none of the hashes we build are seedable. However, note that simply prepending the seed to the input data makes both the good and seemingly-good hashes in this article pass all of the seeding tests as well.
-
Thanks to the birthday paradox, you would only need to compute ~13 billion hashes for a 99% chance of triggering a collision in a 64-bit-equivalent hash, which would handily catch this trick out. You would have to get a little clever in the implementation, since the birthday paradox depends on comparing every hash to every other hash. But that's certainly doable.
-
I say "mixing component", but many of these hashes are designed such that the mixing of the hash state and incorporating an input data block are intertwined, creating a combined "absorber" or "accumulator" component. In such cases, I test the combined component.
-
It's worth noting that a poor worst case for structured inputs may not actually be concerning depending on the rest of the hash design. In particular, if the overall hash design ensures that such patterned states are themselves vanishingly unlikely to occur, then it doesn't really matter. Nevertheless, I've included it in the table for completeness.
Also, it might be surprising to see zero bits of diffusion listed in this column of the table. What does that even mean?
The measure of diffusion used in the table is related to how biased the effects of an input bit are. The goal is for flipping the input bit to have a 50/50 chance of flipping any given output bit. So having no chance of flipping a given output bit and a 100% chance are both given the same score: zero.
If this seems like an odd measure, rest assured that in this table all zero scores are due to an input bit only affecting a single specific output bit.
-
MeowHash is special on this list in a few ways.
First, it's still a work in progress, so these diffusion numbers may change in future iterations. It also means that lack of documentation and published analysis is to some extent expected.
Second, they actually do appear to be applying some rigor in their design process to ensure quality. However, the relevant discussion is currently scattered over many issues on github and is difficult to follow. Difficult to the extent that it's not even clear to me that all of the relevant discussion and reasoning is publicly available at this point.
Third, partly because of the above, I got curious and did a little analysis of its construction, and used that to provide a better (and more flattering) estimate of its diffusion than a simple per-block estimate would have provided. See the source code comments in the supplemental code for details.