Detecting dataset change

A question that we may ask before running expensive computations on a dataset is: has my dataset changed since the last time it was processed? If the dataset and the computation have not changed, then the output will be the same and the computation can be skipped.

Comparing a dataset with its previous version requires two copies of the data to be stored, which can be costly for larger datasets. Instead, we can store a hash of the data to avoid this.

Hashing datasets

Computing the hash value for a dataset, such as a file, scales with the file size. Table 1 shows results on a mock datasets of various sizes:

Records File size Time
1 402B 0.189ms
10 3KB 0.292ms*
100 33KB 0.194ms
1.000 329KB 0.400ms
10.000 3.3MB 3.173ms
100.000 32.8MB 24.368ms
1.000.000 327.6MB 0.223s
10.000.000 3.28GB 2.223s

Table 1: SHA256 hash of the mock data file. The hash is updated in blocks of 8K. The data is generated with Faker. The reported time is the mean of 5 trials of 10 repetitions. *Due to the choice for 8KB block reads.

When working with larger datasets, the time needed to compute the hash increases from negligible (<< 1s) to noticeable (couple of seconds). Note that the time reported is 1/10th of the total time of running the hashing 10 times, to be able to measure more accurately. For larger compute loads that take 1 hour, it might not be a problem to have this overhead if a cache hit means that the job does not need to run. However, in practice, we still can do better.

Approximate Hashers

Many datasets do have metadata associated with it that can be used as an alternative to hashing the data itself. This metadata will remain the same if the data has not changed, and almost certainly changes if modified. Examples are Size, Modification date or ETag. The user is responsible for selecting the most collision-resistant subset of metadata depending on the data generation process from the available metadata. Retrieving the metadata for a dataset requires constant time.

Records Full file hashing Fast approximate hashing Speedup Ratio
1 (402B) 0.189ms 0.017ms 11x
10 (3KB) 0.292ms 0.017ms 17x
100 (33KB) 0.194ms 0.016ms 12x
1.000 (329KB) 0.400ms 0.017ms 23x
10.000 (3.3MB) 3.173ms 0.018ms 180x
100.000 (32.8MB) 24.368ms 0.019ms 1.313x
1.000.000 (327.6MB) 0.223s 0.019ms 11.924x
10.000.000 (3.28GB) 2.223s 0.019ms 114.626x

Table 2: Comparison of SHA256 hashing of the full file and fast approximate hashing for various numbers of records. The reported time is the mean of 5 trials. Each trial consists of 10 repetitions, for which the average time is listed. The benchmark script can be found in timing.py.

In the above table, we can see that the larger the dataset gets, the higher speedup ratio is. To create an ApproximateHasher in pycodehash, we only need to code the logic to obtain this metadata as a dictionary in the collect_metadata method, and the class does the rest. The hash of the metadata is invariant to the ordering of the keys.

Supported Dataset types

The following approximate hashers are implemented at this time:

Dataset Type Implemented Metadata
Local Files File Size, Modification Date
Files on S3 ETag
Hive Tables Size, Creation Date, Is Partitioned
Python Files Magic, Size, Timestamp, Hash, Bit Field

Feel free to open a Pull Request if you would like to contribute additional dataset types or metadata.

Incremental loads

Many datasets consist of collections of objects, or subsets of data:

  • Hive table partitions, e.g. data grouped per day or month
  • Directories of images

Hashing these datasets on the object level, allows for only recomputing the parts of the data that have changed. For these cases, pycodehash has the PartitionedApproximateHasher base class. It requires an ApproximateHasher and implements the collect_partitions method. Currently, there is an implementation of LocalDirectoryHash recursively collects the hashes for each file in that directory. Another implementation of the PartitionedApproximateHasher is LocalFilesHash, which operates on a list of files.

The PartitionedApproximateHasher is an ApproximateHasher in itself, which means that the compute_hash method is supported. This hash is invariant to the ordering of the partitions.