IEP: | 0008 |
---|---|
Title: | ISCC-UNIT Data-Code |
Author: | Titusz Pan tp@iscc.foundation |
Comments: | #13 |
Status: | DRAFT |
Type: | Core |
License: | CC-BY-4.0 |
Created: | {{ git_creation_date_localized }} |
Updated: | {{ git_revision_date_localized }} |
!!! note
This document is a **DRAFT** contributed as input to
[ISO TC 46/SC 9/WG 18](https://www.iso.org/committee/48836.html). The final version is
developed at the International Organization for Standardization as
[ISO/DIS 24138](https://www.iso.org/standard/77899.html)
- The Data-Code shall be a similarity hash for any kind of data regardless of its media type.
- The Data-Code shall cluster digital assets that have near-identical data.
- Small differences (as a proportion of the whole) in referent data shall yield identical Data-Codes.
- More significant differences in referent data shall produce similar Data-Codes that can be compared against each other to estimate the data-similarity of the referents.
- The Data-Code shall be resistant to data shifting and reordering sequences of data within referent data.
!!! note "NOTE"
Changes of the Data-Code do not reflect semantic or syntactic changes of the content.
The Data-Code shall have the data format illustrated in Figure 10:
![Figure 10 - Data format of the Data-Code](images/iscc-iep-0008-f10-data-code.png) Figure 10 - Data format of the Data-Code!!! example "EXAMPLE 1: 64-bit Data-Code in its canonical form:"
ISCC:GAAWAIBQLNWP7X32
!!! example "EXAMPLE 2: 256-bit Data-Code in its canonical form:"
ISCC:GADWAIBQLNWP7X32J3INMAMDUJ4QMN67BBQKVTVZIWHXQ7QJIKHYTBY
The input for calculating the Data-Code shall be the bytes of a file, without reference to their meaning or structure.
Data-Code processing shall generate the following output elements:
- iscc: the Data-Code in its canonical form (required).
An ISCC processor shall calculate the Data-Code as follows:
- Split the data into variable sized chunks with an average chunk size of 1024 bytes using the content defined chunking (CDC) algorithm.
- Calculate the 32-bit integer hash of each chunk using the XXH32 algorithm.
- Apply the minhash algorithm to the array of 32-bit integers to calculate the ISCC-BODY of the Data-Code with appropriate length.
!!! note "NOTE"
For further technical details see source-code in modules
[code_data.py](https://github.com/iscc/iscc-core/blob/main/iscc_core/code_data.py)
and [minhash.py](https://github.com/iscc/iscc-core/blob/main/iscc_core/minhash.py) of the
[reference implementation](https://github.com/iscc/iscc-core).
An implementation of the Data-Code algorithm shall be regarded as conforming to the standard if it creates the same Data-Code as the reference implementation for the same data input.
!!! note "NOTE"
The ISCC reference implementation uses the open source XXHASH library [^1] for XXH32 chunk
hashing and appropriate use of this software will generate the same codes as the
reference implementation.