Awesome
FARSH stands for Fast and Reliable (but not Secure) 32-bit Hash. While established new speed records and successfully passed the SMHasher testsuite, it's not as reliable as the competition. Discussion and additional benchmarks.
Features / to-do list
- compute hashes up to 1024 bits long
- hashing with user-supplied key material
- successfully passed the SMHasher testsuite
- even faster and better quality hash mixing
- SSE2/AVX2 manually-optimized main loop
- 16-byte aligned key material and (optionally) input data for maximum speed on older CPUs
- manual unrolling of main loop (since msvc/icl can't do it themselves) or asm code
- try PSLLQ instead of PSHUFD in SSE2 code to improve speed on older CPUs
-
farsh_init/farsh_update/farsh_result
streaming API -
farsh64*/farsh128*
APIs for faster computation of multi-word hashes -
SSE2/AVX2/NEON?
options in the API (+alignment check for SSE2) for selection of the code path instead of compile-time choice
API
uint32_t farsh(void *data, size_t size, uint64_t seed)
returns 32-bit hash of the buffervoid farsh_n(void *data, size_t size, int k, int n, uint64_t seed, void *hash)
computesn
32-bit hashes starting with the hash numberk
, storing results to thehash
buffer. It'sn
times slower than computation of single 32-bit hash. Hash computed by thefarsh
function has number 0. The function aborts ifk+n > 32
.uint32_t farsh_keyed(void *data, size_t size, void *key, uint64_t seed)
computes 32-bit hash usingkey
, that should be 1024-byte long and aligned to 16-byte boundary.void farsh_keyed_n(void *data, size_t size, void *key, int n, uint64_t seed, void *hash)
computesn
32-bit hashes usingkey
, storing results to thehash
buffer.key
should be1024+16*(n-1)
bytes long and aligned to 16-byte boundary.- Hash functions accept 64-bit
seed
that can be used to "personalize" the hash value. Use seed==0 if you don't need that feature. Seeding may have lower quality than in the competition since the seed value mixed with block hashes rather than raw data. - Header file provides symbolic names for the above-mentioned constants:
FARSH_MAX_HASHES == 32, FARSH_BASE_KEY_SIZE == 1024, FARSH_BASE_KEY_ALIGNMENT == 16, FARSH_EXTRA_KEY_SIZE == 16
Internals
The current FARSH version combines two hashing algorithms.
Low-level hashing algorithm splits all input data into 1024-byte blocks and computes hash value for every block. It's the very short cycle borrowed from UHASH that combines 1024 bytes of input data with 1024 bytes of key material. The hash value returned by this cycle is 64-bit long, and UMAC thesis proved that it has 32 bits of entropy. So the low-level algorithm compresses each 1024-byte block of input data into 64-bit value carrying 32 bits of entropy.
High-level hashing algorithm is a stripped-down version of xxHash64. It receives sequence of 64-bit values from the previous level and combines them into final 32-bit hash result. Since the original xxHash64 algorithm successfully passes all SMHasher tests while computing 64-bit hash from raw data, it's no surprise that modified algorithm is able to compute high-quality 32-bit hash from the sequence of numbers each carrying 32 bits of entropy.
The power of the FARSH algorithm comes from its inner cycle, that is very short (read: fast) and allows highly-parallel implementations, so it can fully exploit power of multi-core, SIMD, VLIW and SIMT (GPU) architectures. At the same time, there is math proof that it can deliver 32 bits of entropy so we can use it without any doubts.
Universal hashing
Main loop uses universal hashing formula from UMAC with a precomputed key material of 1024 bytes (plus 512 bytes for longer hashes). FARSH is essentially UHASH with higher-level hashing algorithms replaced with simpler non-cryptographic one. The universal hashing formula used here (and copied intact from UMAC) is as simple as
uint64_t sum = 0; uint32_t *data, *key;
for (i=0; i < elements; i+=2)
sum += uint64_t(data[i] + key[i]) * (data[i+1] + key[i+1]);
The main loop
- Source code
- Asm code (can be found by searching for adcl+mull/pmuludq instructions)
Benchmark
Benchmark measures overall hash speed as well as internal loop speed. The internal loop speed is a hard limit for the speed of any future FARSH version, while the overall speed includes time required for pretty slow high-level hashing. Future versions should replace it with faster algorithm still satisfying the SMHasher requirements, making overall hash speed within 10% of the internal loop speed.
Executables were compiled with GCC 4.9.2. Aligned versions make sure that data being hashed are 64-byte aligned, unaligned versions make sure that data are unaligned. This makes big difference on Core2 and older Intel CPUs.
Intel Haswell i7-4770 3.9 GHz (AVX2), other IvyBridge to Skylake CPUs has pretty close performance/GHz:
Executable | FARSH 0.2 speed | Internal loop speed |
---|---|---|
aligned-farsh-x64-avx2 | 54.536 GB/s = 50.790 GiB/s | 65.645 GB/s = 61.137 GiB/s |
aligned-farsh-x64 | 31.162 GB/s = 29.022 GiB/s | 35.722 GB/s = 33.269 GiB/s |
aligned-farsh-x86-avx2 | 40.279 GB/s = 37.513 GiB/s | 61.682 GB/s = 57.446 GiB/s |
aligned-farsh-x86-sse2 | 25.221 GB/s = 23.489 GiB/s | 33.584 GB/s = 31.277 GiB/s |
aligned-farsh-x86 | 6.255 GB/s = 5.825 GiB/s | 6.336 GB/s = 5.901 GiB/s |
farsh-x64-avx2 | 46.024 GB/s = 42.863 GiB/s | 64.967 GB/s = 60.505 GiB/s |
farsh-x64 | 30.335 GB/s = 28.252 GiB/s | 34.891 GB/s = 32.495 GiB/s |
farsh-x86-avx2 | 35.273 GB/s = 32.851 GiB/s | 57.252 GB/s = 53.320 GiB/s |
farsh-x86-sse2 | 24.502 GB/s = 22.820 GiB/s | 33.325 GB/s = 31.037 GiB/s |
farsh-x86 | 6.283 GB/s = 5.852 GiB/s | 6.763 GB/s = 6.299 GiB/s |
Intel Pentium M processor 1.5 GHz (SSE2):
Executable | FARSH 0.2 speed | Internal loop speed |
---|---|---|
aligned-farsh-x86-sse2 | 2.625 GB/s = 2.444 GiB/s | 2.791 GB/s = 2.5 GiB/s |
aligned-farsh-x86 | 1.664 GB/s = 1.550 GiB/s | 1.946 GB/s = 1.8 GiB/s |
farsh-x86-sse2 | 2.025 GB/s = 1.886 GiB/s | 2.302 GB/s = 2.1 GiB/s |
farsh-x86 | 1.471 GB/s = 1.370 GiB/s | 1.715 GB/s = 1.5 GiB/s |
K10: AMD Athlon II X2 220 Processor 2.8 GHz (SSE3):
Executable | FARSH 0.2 speed | Internal loop speed |
---|---|---|
aligned-farsh-x64 | 11.300 GB/s = 10.524 GiB/s | 14.446 GB/s = 13.454 GiB/s |
aligned-farsh-x86-sse2 | 10.899 GB/s = 10.151 GiB/s | 13.280 GB/s = 12.368 GiB/s |
aligned-farsh-x86 | 3.805 GB/s = 3.544 GiB/s | 5.089 GB/s = 4.740 GiB/s |
farsh-x64 | 12.823 GB/s = 11.943 GiB/s | 14.187 GB/s = 13.212 GiB/s |
farsh-x86-sse2 | 10.933 GB/s = 10.182 GiB/s | 12.389 GB/s = 11.538 GiB/s |
farsh-x86 | 3.786 GB/s = 3.526 GiB/s | 5.825 GB/s = 5.425 GiB/s |
Piledriver: AMD A8-5500 APU 3.7 GHz (AVX):
Executable | FARSH 0.2 speed | Internal loop speed |
---|---|---|
aligned-farsh-x64 | 17.130 GB/s = 15.953 GiB/s | 21.394 GB/s = 19.924 GiB/s |
aligned-farsh-x86-sse2 | 13.790 GB/s = 12.843 GiB/s | 20.830 GB/s = 19.400 GiB/s |
aligned-farsh-x86 | 3.872 GB/s = 3.606 GiB/s | 5.457 GB/s = 5.082 GiB/s |
farsh-x64 | 15.313 GB/s = 14.262 GiB/s | 19.659 GB/s = 18.309 GiB/s |
farsh-x86-sse2 | 13.812 GB/s = 12.863 GiB/s | 18.977 GB/s = 17.674 GiB/s |
farsh-x86 | 3.959 GB/s = 3.687 GiB/s | 5.056 GB/s = 4.709 GiB/s |
More results and benchmarking executables are available in those forum posts.
Competition
Fast non-cryptographic hashes:
- MumHash (2016)
- HighwayHash (2016)
- CLHash, even faster with Broadwell (2015)
- MetroHash (2015)
- Go language 32-bit and 64-bit hashes (2014)
- xxHash (2012) and xxHash64 (2014)
- SpookyHash: a 128-bit noncryptographic hash (2012)
- The CityHash family of hash functions (2011)
- MurmurHash3 (2011)
- Hasshe2 by Cessu (2008)
Further reading:
- More info about the SMHasher testsuite
- A lot of hashes tested by SMHasher (see doc subdir)
- Interesting historical overview
- SuperFastHash
- Bob Jenkins 1997 Dr Dobbs article and its extended version
MAC/PRF, i.e. cryprographically secure keyed hashes:
- UMAC and VMAC
- The Poly1305-AES message-authentication code
- SipHash
- Cryptoanalysis of CityHash64, MurmurHash and xxHash