Home

Awesome

hash4j logo hash4j

License Maven Central javadoc CodeQL Quality Gate Status Coverage Java 11 or higher

hash4j is a Java library by Dynatrace that includes various non-cryptographic hash algorithms and data structures that are based on high-quality hash functions.

Content

First steps

To add a dependency on hash4j using Maven, use the following:

<dependency>
  <groupId>com.dynatrace.hash4j</groupId>
  <artifactId>hash4j</artifactId>
  <version>0.18.0</version>
</dependency>

To add a dependency using Gradle:

implementation 'com.dynatrace.hash4j:hash4j:0.18.0'

Hash algorithms

hash4j currently implements the following hash algorithms:

All hash functions are thoroughly tested against the native reference implementations and also other libraries like Guava Hashing, Zero-Allocation Hashing, Apache Commons Codec, or crypto (see CrossCheckTest.java).

Usage

The interface allows direct hashing of Java objects in a streaming fashion without first mapping them to byte arrays. This minimizes memory allocations and keeps the memory footprint of the hash algorithm constant regardless of the object size.

class TestClass { 
    int a = 42;
    long b = 1234567890L;
    String c = "Hello world!";
}

TestClass obj = new TestClass(); // create an instance of some test class
    
Hasher64 hasher = Hashing.komihash5_0(); // create a hasher instance

// variant 1: hash object by passing data into a hash stream
long hash1 = hasher.hashStream().putInt(obj.a).putLong(obj.b).putString(obj.c).getAsLong(); // gives 0x89a90f343c3d4862L

// variant 2: hash object by defining a funnel
HashFunnel<TestClass> funnel = (o, sink) -> sink.putInt(o.a).putLong(o.b).putString(o.c);
long hash2 = hasher.hashToLong(obj, funnel); // gives 0x90553fd9c675dfb2L

More examples can be found in HashingDemo.java.

Similarity hashing

Similarity hashing algorithms are able to compute hash signature of sets that allow estimation of set similarity without using the original sets. Following algorithms are currently available:

Usage

ToLongFunction<String> stringHashFunc = s -> Hashing.komihash5_0().hashCharsToLong(s);

Set<String> setA = IntStream.range(0, 90000).mapToObj(Integer::toString).collect(toSet());
Set<String> setB = IntStream.range(10000, 100000).mapToObj(Integer::toString).collect(toSet());
// intersection size = 80000, union size = 100000
// => exact Jaccard similarity of sets A and B is J = 80000 / 100000 = 0.8

int numberOfComponents = 1024;
int bitsPerComponent = 1;
// => each signature will take 1 * 1024 bits = 128 bytes

SimilarityHashPolicy policy =
SimilarityHashing.superMinHash(numberOfComponents, bitsPerComponent);
SimilarityHasher simHasher = policy.createHasher();

byte[] signatureA = simHasher.compute(ElementHashProvider.ofCollection(setA, stringHashFunc));
byte[] signatuerB = simHasher.compute(ElementHashProvider.ofCollection(setB, stringHashFunc));

double fractionOfEqualComponents = policy.getFractionOfEqualComponents(signatureA, signatuerB);

// this formula estimates the Jaccard similarity from the fraction of equal components
double estimatedJaccardSimilarity =
    (fractionOfEqualComponents - Math.pow(2., -bitsPerComponent))
        / (1. - Math.pow(2., -bitsPerComponent)); // gives a value close to 0.8

See also SimilarityHashingDemo.java.

Approximate distinct counting

Counting the number of distinct elements exactly requires space that must increase linearly with the count. However, there are algorithms that require much less space by counting just approximately. The space-efficiency of those algorithms can be compared by means of the storage factor which is defined as the state size in bits multiplied by the squared relative standard error of the estimator

$\text{storage factor} := (\text{relative standard error})^2 \times (\text{state size})$.

This library implements two algorithms for approximate distinct counting:

Both algorithms share the following properties:

Usage

Hasher64 hasher = Hashing.komihash5_0(); // create a hasher instance

UltraLogLog sketch = UltraLogLog.create(12); // corresponds to a standard error of 1.2% and requires 4kB

sketch.add(hasher.hashCharsToLong("foo"));
sketch.add(hasher.hashCharsToLong("bar"));
sketch.add(hasher.hashCharsToLong("foo"));

double distinctCountEstimate = sketch.getDistinctCountEstimate(); // gives a value close to 2

See also UltraLogLogDemo.java and HyperLogLogDemo.java.

Compatibility

HyperLogLog and UltraLogLog sketches can be reduced to corresponding sketches with smaller precision parameter p using sketch.downsize(p). UltraLogLog sketches can be also transformed into HyperLogLog sketches with same precision parameter using HyperLogLog hyperLogLog = HyperLogLog.create(ultraLogLog); as demonstrated in ConversionDemo.java. HyperLogLog can be made compatible with implementations of other libraries which also use a single 64-bit hash value as input. The implementations usually differ only in which bits of the hash value are used for the register index and which bits are used to determine the number of leading (or trailing) zeros. Therefore, if the bits of the hash value are permuted accordingly, compatibility can be achieved.

File hashing

This library contains an implementation of Imohash that allows fast hashing of files. It is based on the idea of hashing only the beginning, a middle part and the end, of large files, which is usually sufficient to distinguish files. Unlike cryptographic hashing algorithms, this method is not suitable for verifying the integrity of files. However, this algorithm can be useful for file indexes, for example, to find identical files.

Usage

// create some file in the given path
File file = path.resolve("test.txt").toFile();
try (FileWriter fileWriter = new FileWriter(file, StandardCharsets.UTF_8)) {
    fileWriter.write("this is the file content");
}

// use ImoHash to hash that file
HashValue128 hash = FileHashing.imohash1_0_2().hashFileTo128Bits(file);
// returns 0xd317f2dad6ea7ae56ff7fdb517e33918

See also FileHashingDemo.java.

Consistent hashing

This library contains various algorithms for the distributed agreement on the assignment of hash values to a given number of buckets. In the naive approach, the hash values are assigned to the buckets with the modulo operation according to bucketIdx = abs(hash) % numBuckets. If the number of buckets is changed, the bucket index will change for most hash values. With a consistent hash algorithm, the above expression can be replaced by bucketIdx = consistentBucketHasher.getBucket(hash, numBuckets) to minimize the number of reassignments while still ensuring a fair distribution across all buckets.

The following consistent hashing algorithms are available:

Usage

// create a consistent bucket hasher
ConsistentBucketHasher consistentBucketHasher =
    ConsistentHashing.jumpBackHash(PseudoRandomGeneratorProvider.splitMix64_V1());

long[] hashValues = {9184114998275508886L, 7090183756869893925L, -8795772374088297157L};

// determine assignment of hash values to 2 buckets
Map<Integer, List<Long>> assignment2Buckets =
    LongStream.of(hashValues)
        .boxed()
        .collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 2)));
// gives {0=[7090183756869893925, -8795772374088297157], 1=[9184114998275508886]}

// determine assignment of hash values to 3 buckets
Map<Integer, List<Long>> assignment3Buckets =
    LongStream.of(hashValues)
        .boxed()
        .collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 3)));
// gives {0=[7090183756869893925], 1=[9184114998275508886], 2=[-8795772374088297157]}
// hash value -8795772374088297157 got reassigned from bucket 0 to bucket 2
// probability of reassignment is equal to 1/3

See also ConsistentHashingDemo.java.

Benchmark results

Benchmark results for different revisions can be found here.

Contribution FAQ

Coding style

To ensure that your contribution adheres to our coding style, run the spotlessApply Gradle task.

Python

This project contains python code. We recommend using a python virtual environment in a .venv directory. If you are new, please follow the steps outlined in the official Python documentation for creation and activation. To install the required dependencies including black, please execute pip install -r requirements.txt.

Reference implementations

Reference implementations of hash algorithms are included as git submodules within the reference-implementations directory and can be fetched using git submodule update --init --recursive.