2024 - 07 - 10

Hash Design and Goodhart's Law

SMHasher is a popular test suite created by Austin Appleby for testing hash functions for weaknesses and speed. There is also a more up-to-date fork.

SMHasher is widely used in the design and vetting of non-cryptographic hash functions, and non-cryptographic hash functions often advertise that they pass the entire SMHasher test suite as a signal of hash quality.

This is a problem for hashes with large output sizes, and in this post we're going to explore why that is.

In the process of exploring that, we're also going to:

Create a 128-bit hash function that cleanly passes SMHasher, while nevertheless having issues that are obvious from analysis.
Identify a common issue with most currently published large-output non-cryptographic hashes.

Please note that this entire post is about hashes intended for use in non-adversarial conditions. Hashes that need to withstand attacks have additional considerations, and I'm not qualified to discuss that.¹

Goodhart's Law

There is a fundamental problem with designing hashes against SMHasher, which basically comes down to Goodhart's law:

When a measure becomes a target, it ceases to be a good measure.

Don't get me wrong, SMHasher is a good test suite, and any large-output hash function worth its salt ought to pass it. But it's infeasible to catch all issues with large-output hashes using empirical tests due to the enormous state space of such hashes, and therefore test suites like SMHasher can only ever be an incomplete measure of quality.

For hashes with small outputs (e.g. 32 or 64 bits), using SMHasher as a quality benchmark is probably fine. For one thing, it's more feasible for empirical tests to be close to exhaustive for small-output hashes. But also, most use cases for such hashes (e.g. hash tables) expect and have to handle collisions anyway, so the quality requirements aren't nearly as stringent in the first place.

However, 128-bit+ hashes are another story. The use cases for large-output hashes are different, and in many of those use cases (e.g. content-addressed systems, data fingerprinting, etc.) collisions are unacceptable and can lead to data loss. Since large-output hashes are infeasible to thoroughly test for issues empirically, if a large-output hash function is just tweaked until it passes SMHasher (or any other test suite), there's a good chance it will still have quality issues that make collisions more likely than its output size suggests.

To help make my point, we're going to design a 128-bit hash function that has clear issues when properly analyzed, but which nevertheless appears to have best-in-class quality according to SMHasher.

The Mixing Function

To get started, let's choose a mixing function for our hash. Mixing functions are at the heart of many hash function designs, and we'll be putting one at the heart of ours as well.

Mixing functions take a fixed-size chunk of data and mix up its bits. There are some qualities that you generally want out of a mixing function:

It should be bijective. Or put differently, it should be reversible. Or put even more differently: it shouldn't lose any data.
It should have good avalanche. Another way of putting this is that every input bit should be well diffused into the output. You can imagine putting a tiny drop of dye into a glass of water, and then mixing it all up to evenly distribute the dye. That's roughly the idea here, but with bits instead of drops of dye.

(There are other criteria that are also useful, but a discussion of them isn't necessary for this post.)

In our case, we're going to use a mixing function inspired by ThreeFish's MIX function. It takes two unsigned 64-bit integers and mixes their bits up across the total 128 bits of both:

rotation_constants = [
  12, 39, 21, 13, 32, 11, 24, 53,
  17, 27, 57, 13, 50, 8, 52, 8
]

function mix(A, B, rounds):
  for i in 0 to rounds:
    A += B + 1
    B = B.rotate_left(rotation_constant[i]) ^ A

Let's analyze this mixing function.

First, it's definitely bijective. All of the operations are trivially reversible, and we can easily write a function that undoes the mixing:

function unmix(A, B, rounds):
  for i in reverse(0 to rounds):
    B = (B ^ A).rotate_right(rotation_constant[i])
    A -= B + 1

So that's one check mark!

To analyze the avalanche, we can generate an avalanche diagram², which is essentially a graph of pixels where the vertical axis is the input bit, the horizontal axis is the output bit, and the color of a pixel indicates how much effect that input bit has on that output bit on average.

Here's an avalanche diagram for our mixing function with 1 round:

Mixer avalanche at 1 round

Black pixels mean that flipping the input bit never affects the output bit, and white means that it always affects it. Grays mean that sometimes there's an effect and sometimes not. Grays represent good diffusion. So 1 round is pretty awful, as it turns out.

Here's 4 and 8 rounds:

Mixer avalanche at 4 rounds Mixer avalanche at 8 rounds

As you can see, more rounds improves the avalanche/diffusion. But we're still not quite there. For full avalanche this mix function actually needs 12 rounds, at which point the image becomes an even gray:

Mixer avalanche at 12 rounds

So with 12 rounds, this is a pretty good mix function! For our purposes it's also useful that we can tweak its quality to be anything from awful to good—we'll be exploiting that later.

Our First Bad Hash

Using our new mixing function, we're going to first create an awful hash that SMHasher does identify issues with.

We're going to take a block-based approach to hashing our data. The idea is that we split up the input data into equal-sized blocks, and shove them into our hash state one at a time. The last block may not always be large enough, so in that case we pad it out with zeros.

So here's our very first awful hash:

function goodhart_hash_1(input_data):
    hash_state = [0, 0] // Two 64-bit integers.

    for block in input_data: // 128-bit blocks.
        pad_block_if_needed(block)
        hash_state[0] ^= block.first_half
        hash_state[1] ^= block.second_half

    mix(hash_state, 12 rounds)

    return hash_state

There are some really bad things about this hash, which we'll get to in a moment. But first, one apparently good property that this does have is its avalanche:

Goodhart Hash 1 avalanche diagram

(Note: I've rotated this diagram 90 degrees for easier viewing, so now the input bits are the horizontal axis.)

This graph shows the avalanche of a 512-bit input, and it looks perfect! This means that every input bit affects every output bit with roughly equal probability. So at first glance this seems awesome.

However, putting our hash through SMHasher³ tells a different story (full output):

AppendedZeroesTest	FAIL
Avalanche Tests	Pass
Keyset 'Sparse' Tests	FAIL
Keyset 'Permutation' Tests	FAIL
Keyset 'Window' Tests	Pass
Keyset 'Cyclic' Tests	FAIL
Keyset 'TwoBytes' Tests	FAIL
Keyset 'Text' Tests	Pass
Keyset 'Zeroes' Tests	FAIL
Diff 'Differential' Tests	FAIL
DiffDist 'Differential Distribution' Tests	Pass
MomentChi2 Tests	Pass
Prng Tests	Pass
BIC 'Bit Independence Criteria' Tests	Pass

That's quite a lot of failures!

(Note: if you want to reproduce the SMHasher results for the hashes we create in this post, the code is available.)

Fixing the First Failure

The very first failure, "AppendedZeroesTest", is a pretty obvious one.

The reason our hash fails this is because of our zero-padding scheme. If the last block is too small, we append enough zeros to make it the same size as the other blocks. The consequence is that we can't, for example, distinguish between a final 3-byte block containing "abc" and a final 10-byte block that starts with "abc" followed by seven 0 bytes.

And that's exactly what "AppendedZeroesTest" tests for.

There are a couple of approaches to address this. But the simplest—and the one we'll use—is to incorporate the input length into the hash. And that gives us Goodhart Hash 2:

function goodhart_hash_2(input_data):
    hash_state = [0, 0] // Two 64-bit integers.

    for block in input_data: // 128-bit blocks.
        pad_block_if_needed(block)
        hash_state[0] ^= block.first_half
        hash_state[1] ^= block.second_half
    
    // Incorporate the input length.
    hash_state[0] ^= input_data.length()

    mix(hash_state, 12 rounds)

    return hash_state

By incorporating the input length, we ensure that extending the input data with zeros changes the resulting hash.

As a result, we now have two fewer failures in SMHasher (full output):

AppendedZeroesTest	Pass
Avalanche Tests	Pass
Keyset 'Sparse' Tests	FAIL
Keyset 'Permutation' Tests	FAIL
Keyset 'Window' Tests	Pass
Keyset 'Cyclic' Tests	FAIL
Keyset 'TwoBytes' Tests	FAIL
Keyset 'Text' Tests	Pass
Keyset 'Zeroes' Tests	Pass
Diff 'Differential' Tests	FAIL
DiffDist 'Differential Distribution' Tests	Pass
MomentChi2 Tests	Pass
Prng Tests	Pass
BIC 'Bit Independence Criteria' Tests	Pass

That's progress!

Fixing the Remaining Failures

The remaining failures all stem from the same underlying issue: we xor all the input blocks into the state without doing any mixing, and only mix at the end.

This has two major consequences:

The input blocks can be shuffled however you like, and the resulting hash will be the same.
Even if shuffling weren't a concern, the interaction between the bits of different blocks is simple rather than complex. For example, you can flip the first bit in one block, and flip the same bit in another block, and they cancel out.

Both of these issues can be resolved by doing mixing between xoring the input blocks. And that gives us Goodhart Hash 3:

function goodhart_hash_3(input_data):
    hash_state = [0, 0] // Two 64-bit integers.

    for block in input_data: // 128-bit blocks.
        pad_block_if_needed(block)
        hash_state[0] ^= block.first_half
        hash_state[1] ^= block.second_half
        mix(hash_state, 12 rounds) // Mix here too.
    
    // Incorporate the input length.
    hash_state[0] ^= input_data.length()

    mix(hash_state, 12 rounds)
    
    return hash_state

This fully resolves both issues:

Mixing the hash state before xoring the next block breaks the commutativity of the xor, and therefore changing the order of the blocks now produces different hashes.
Since each input bit is fully diffused into the hash state before incorporating the next block, the relationship between the bits of different blocks is complex: flipping a single bit anywhere in the input can only (with overwhelming probability) be cancelled out by a complex change elsewhere in the input.

If we run this hash through SMHasher (full output):

AppendedZeroesTest	Pass
Avalanche Tests	Pass
Keyset 'Sparse' Tests	Pass
Keyset 'Permutation' Tests	Pass
Keyset 'Window' Tests	Pass
Keyset 'Cyclic' Tests	Pass
Keyset 'TwoBytes' Tests	Pass
Keyset 'Text' Tests	Pass
Keyset 'Zeroes' Tests	Pass
Diff 'Differential' Tests	Pass
DiffDist 'Differential Distribution' Tests	Pass
MomentChi2 Tests	Pass
Prng Tests	Pass
BIC 'Bit Independence Criteria' Tests	Pass

Tada! It passes all the tests!

In fact, although I haven't done the analysis needed to make any claims, this might actually be a high-quality hash. It's also simple and fairly easy to understand.

Of course, I said we were going to make a bad hash that passes SMHasher, so you might feel a little let down right now. Don't worry, we're getting there.

The main problem with this hash is that it's not very fast. Which means we can't put up cool websites and boast about it being the most blazingly fastestest hash there evar wuz! And that's a problem, because people love speed. Gotta look good in those benchmarks.

Making Our Hash Faster and Worser

One way we can make our hash faster is to do fewer mixing rounds. Because really (he says foreshadowing), we surely don't need all of those rounds, and SMHasher will tell us if we have too few, right?

Let's give it a shot. We'll reduce the mix rounds in the inner loop to 4, which should make it over twice as fast! However, we'll still do a 12-round mix after incorporating the input length to finalize everything.

Behold Goodhart Hash 4:

function goodhart_hash_4(input_data):
    hash_state = [0, 0] // Two 64-bit integers.

    for block in input_data: // 128-bit blocks.
        pad_block_if_needed(block)
        hash_state[0] ^= block.first_half
        hash_state[1] ^= block.second_half
        mix(hash_state, 4 rounds)
    
    // Incorporate the input length.
    hash_state[0] ^= input_data.length()

    // Finalize.
    mix(hash_state, 12 rounds)

    return hash_state

So what does SMHasher say? (full output)

AppendedZeroesTest	Pass
Avalanche Tests	Pass
Keyset 'Sparse' Tests	FAIL
Keyset 'Permutation' Tests	Pass
Keyset 'Window' Tests	Pass
Keyset 'Cyclic' Tests	Pass
Keyset 'TwoBytes' Tests	Pass
Keyset 'Text' Tests	Pass
Keyset 'Zeroes' Tests	Pass
Diff 'Differential' Tests	Pass
DiffDist 'Differential Distribution' Tests	Pass
MomentChi2 Tests	Pass
Prng Tests	Pass
BIC 'Bit Independence Criteria' Tests	Pass

Damn, so close! We're only failing one test.

No matter! If 4 rounds isn't enough, we'll just bump 'em up to 5! That should do the trick.

Goodhart Hash 5:

function goodhart_hash_5(input_data):
    hash_state = [0, 0] // Two 64-bit integers.

    for block in input_data: // 128-bit blocks.
        pad_block_if_needed(block)
        hash_state[0] ^= block.first_half
        hash_state[1] ^= block.second_half
        mix(hash_state, 5 rounds)
    
    // Incorporate the input length.
    hash_state[0] ^= input_data.length()

    // Finalize.
    mix(hash_state, 12 rounds)

    return hash_state

SMHasher's report (full output):

AppendedZeroesTest	Pass
Avalanche Tests	Pass
Keyset 'Sparse' Tests	Pass
Keyset 'Permutation' Tests	Pass
Keyset 'Window' Tests	Pass
Keyset 'Cyclic' Tests	Pass
Keyset 'TwoBytes' Tests	Pass
Keyset 'Text' Tests	Pass
Keyset 'Zeroes' Tests	Pass
Diff 'Differential' Tests	Pass
DiffDist 'Differential Distribution' Tests	Pass
MomentChi2 Tests	Pass
Prng Tests	Pass
BIC 'Bit Independence Criteria' Tests	Pass

We did it! We improved the performance of our hash by around 2x, while maintaining its quality! We know it's true, because SMHasher said so.

Right?

Goodhart's Law Again

Unfortunately, just because SMHasher gave a passing score doesn't mean our hash is actually as robust as its 128-bit output size implies. In reality, "pass" in SMHasher just means "couldn't find an issue within the limitations of our tests".

The principle behind Goodhart's law is that most tests aren't exhaustive, and typically are also stand ins for the thing you actually want to know. In our case, we want to know if our hash is as robust as its 128-bit size implies. But that's not feasible to test empirically, so instead SMHasher tests the subset of things that are feasible to test in the hope that it catches issues.

This has value if the hash is designed independently of the test suite, and the test suite is just used after the fact as a kind of sanity check. But if the hash is designed against the test, all bets are off.

And this is exactly what happened when we went from 4 mixing rounds to 5. All we did was push the issues with our hash barely outside of what SMHasher tests for. But the issue is still there.

Recall that to fix most of the errors from our initial awful hash, we had to add a mixing step between input blocks. On some level, all of those tests were (directly or indirectly) trying to ensure that this mixing was sufficient. At 4 rounds only one of those tests failed, but most of the other tests would have failed as well if given enough compute time and memory—they only passed because doing that isn't practical. Then when we bumped the mix rounds up to 5, the only failing holdout was pushed past its breaking point as well, and passed.

Here's the avalanche diagram for 5 mixing rounds:

Mixer avalanche at 5 round

This is what Goodhart Hash 5 is doing between input blocks. It clearly isn't fully mixed/diffused.

Specifically, the average diffusion (in terms of Shannon entropy) is 106 bits, and the worst-diffused bit only has 45 bits of diffusion. And that's for random inputs. For a structured-input test (an incrementing integer), the average is 66 bits and the worst is 24 bits.

That means we only reliably have 24 bits of complexity between blocks. In our mixing function's case that means that certain bit flips can be cancelled out by a 12-bit change elsewhere in the input on average.

To be clear, that's not quite as bad as it sounds. Only specific input bits have this property, and the worst case is only triggered by certain input data. And even then the cancellation only happens at such low complexity in adjacent blocks, not any block within the whole input. So it's not as if the whole hash is reduced to the equivalent of a 24-bit hash.

Nevertheless, our hash does not reliably have the low collision probability that its 128-bit size suggests. This is a problem if we advertise it as a high-quality hash, because people will then reasonably assume that the probability of collisions matches that 128-bit size, and may subsequently use our hash where that's important.

A Final Awful Hash

To really drive the point home, let's make one last hash.

Unlike our previous hashes, which followed a fairly realistic progression, we're going to intentionally build this hash to be far, far weaker than its output size would imply. All while still passing SMHasher with flying colors.

Here's Goodhart Hash 6:

function goodhart_hash_6(input_data):
    hash_state = [0, 0] // Two 64-bit integers.

    for block in input_data: // 128-bit blocks.
        pad_block_if_needed(block)
        hash_state[0] ^= block.first_half
        hash_state[1] ^= block.second_half
        mix(hash_state, 5 rounds)
    
    // Incorporate the input length.
    hash_state[0] ^= input_data.length()

    // Finalize.
    mix(hash_state, 12 rounds)

    // Be evil.
    hash_state[1] = 0
    mix(hash_state, 12 rounds)

    return hash_state

The only difference between this and Goodhart Hash 5 is that at the very end we zero out the last 64 bits of the hash, and then mix everything up again.

The reason this is evil is because it strictly reduces our hash to the equivalent of a mere 64-bit hash. All while still having an output that looks like a 128-bit hash.

Here's what SMHasher says (full output):

AppendedZeroesTest	Pass
Avalanche Tests	Pass
Keyset 'Sparse' Tests	Pass
Keyset 'Permutation' Tests	Pass
Keyset 'Window' Tests	Pass
Keyset 'Cyclic' Tests	Pass
Keyset 'TwoBytes' Tests	Pass
Keyset 'Text' Tests	Pass
Keyset 'Zeroes' Tests	Pass
Diff 'Differential' Tests	Pass
DiffDist 'Differential Distribution' Tests	Pass
MomentChi2 Tests	Pass
Prng Tests	Pass
BIC 'Bit Independence Criteria' Tests	Pass

Still passes.

This specific evil trick could feasibly be teased out empirically, albeit with a very memory-hungry test.⁴ But the point is that these kinds of issues are in general infeasible to test for empirically with large hashes.

Sussing out issues with large hashes requires analysis, along with testing their individual components.

Which Large Hashes Are Good?

Before we jump in to look at some other hashes, I first want to re-emphasize what I said at the beginning of this post: none of this is relevant to small-output hashes. Many of the hashes below also have small-output variants, and nothing in this post should be taken to suggest anything about those variants.

I also don't mean to pick on these hashes in particular. Rather, I'm using them to make a broader point that should, I think, make you skeptical of all large-output hashes that don't provide a public analysis and justification of their design.

With that said, let's jump in.

We're going to focus here on one particular weakness that many large non-cryptographic hashes have in common. In fact, it's the same weakness we introduced into Goodhart Hash 4 and 5: not enough mixing between blocks.

What I've done is isolated the mixing component⁵ of each hash, and measured its avalanche/diffusion.

The table below lists the average and worst diffusion for random inputs, as well as the worst diffusion for patterned inputs⁶ (such as an incrementing counter). Worst diffusion means the diffusion of the least-diffused input bit. The table also links to avalanche diagrams for random inputs.

Hash	Average Diffusion (random)	Worst Diffusion (random)	Worst Diffusion (structured)⁶	Avalanche Diagram
AquaHash	32 bits	32 bits	0 bits	diagram
CityHash128 / FarmHash128	58 bits	3 bits	0 bits	diagram
MeowHash 0.5⁷	74 bits	32 bits	0 bits	diagram
MetroHash128	29 bits	3 bits	0 bits	diagram
Murmur3 128-bit	83 bits	33 bits	24 bits	diagram
xxHash3 128-bit	38 bits	33 bits	0 bits	diagram

A note of caution: this table needs to be interpreted carefully because these hashes have a variety of designs, some of which might make them tolerant of less mixing between input blocks to a greater or lesser extent. Having said that, some of them definitely don't have such a design.

Moreover, as far as I can tell, none of these hashes⁷ have published analysis showing that their design is, in fact, tolerant of that to an extent that would live up to their 128-bit output size. On the contrary, in many cases the documentation they provide explicitly outlines their reliance on empirical tests (such as SMHasher) for quality verification.

Mind you, I'm not suggesting that these hashes were designed at random without any insight behind them. But that's not the same as being rigorous enough to have confidence in the hash's quality.

In short, without (public) analysis to demonstrate otherwise, it appears that none of these hashes fully live up to their 128-bit size. For some applications that may be fine. But for many it's definitely not.

(Note: if you want to reproduce the results in this table or check my work, the code is available.)

Conclusion

The wrong thing to take away from this post is that SMHasher is somehow a bad test suite. On the contrary, SMHasher is quite a good test suite.

The real take away is that despite SMHasher being a good test suite, it cannot—and will never be able to—reliably catch issues in large-output hashes. And especially given the use cases that large-output hashes often have, that means we shouldn't rely on SMHasher (or any other test suite!) as a stamp of quality for them.

The only way to have real confidence in large-output hashes is through analysis and justification of their design. And if the creator(s) of a large-output hash want other people to use it, it is incumbent upon them to provide that analysis and justification.

If a creator doesn't provide that, don't trust the hash. We should hold our hashes to a higher standard.

Footnotes

Moreover, the norms around hashes with security claims are already much more rigorous. So the main thrust of this post, "you should publicly analyze and justify your hash designs", is redundant in that context since it's already expected.
Proper avalanche analysis should be done with more than eyeballing a diagram, but diagrams do provide important intuition for what's going on. I don't cover it in this post because it's not important to the point, but I did more than just look at avalanche diagrams to determine the number of rounds needed for full diffusion.
Throughout this article we omit SMHasher's seeding tests (e.g. the Perlin Noise test) because none of the hashes we build are seedable. However, note that simply prepending the seed to the input data makes both the good and seemingly-good hashes in this article pass all of the seeding tests as well.
Thanks to the birthday paradox, you would only need to compute ~13 billion hashes for a 99% chance of triggering a collision in a 64-bit-equivalent hash, which would handily catch this trick out. You would have to get a little clever in the implementation, since the birthday paradox depends on comparing every hash to every other hash. But that's certainly doable.
I say "mixing component", but many of these hashes are designed such that the mixing of the hash state and incorporating an input data block are intertwined, creating a combined "absorber" or "accumulator" component. In such cases, I test the combined component.
It's worth noting that a poor worst case for structured inputs may not actually be concerning depending on the rest of the hash design. In particular, if the overall hash design ensures that such patterned states are themselves vanishingly unlikely to occur, then it doesn't really matter. Nevertheless, I've included it in the table for completeness.

Also, it might be surprising to see zero bits of diffusion listed in this column of the table. What does that even mean?

The measure of diffusion used in the table is related to how biased the effects of an input bit are. The goal is for flipping the input bit to have a 50/50 chance of flipping any given output bit. So having no chance of flipping a given output bit and a 100% chance are both given the same score: zero.

In other words, a score of zero for an input bit means that its effect on the output is always exactly the same, independent of the overall input. Often (but not always) this also means that the input bit in question only affects a single specific output bit.
MeowHash is special on this list in a few ways.

First, it's still a work in progress, so these diffusion numbers may change in future iterations. It also means that lack of documentation and published analysis is to some extent expected.

Second, they actually do appear to be applying some rigor in their design process to ensure quality. However, the relevant discussion is currently scattered over many issues on github and is difficult to follow. Difficult to the extent that it's not even clear to me that all of the relevant discussion and reasoning is publicly available at this point.

Third, partly because of the above, I got curious and did a little analysis of its construction, and used that to provide a better (and more flattering) estimate of its diffusion than a simple per-block estimate would have provided. See the source code comments in the supplemental code for details.