-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HDF file format #109
Comments
Hello @ll4strw. You can already create standard conformant ISCCs for HDF files (actually for any file type) using the ISCC SUM SubType Here is an example script to generate ISCCs for any file type: """Create an ISCC-CODE for any file"""
from os.path import basename
import iscc_sdk as idk
import iscc_core as ic
def code_iscc_sum(fp, ld_type="Dataset"):
# type: (str, str) -> idk.IsccMeta
"""
Generate a minimal ISCC-CODE (SubType SUM)
The ISCC-CODE SUM is a combination of the Data-Code and Instance-Code UNITS.
As such it can handle any file irrespective of the file format.
:param str fp: Filepath used for ISCC-CODE creation.
:param str ld_type: JSON-LD schema.org type of the identified file
:return: ISCC metadata including ISCC-CODE
:rtype: IsccMeta
"""
# Prepare basic metadata
with open(fp, "rb") as infile:
data = infile.read(4096)
meta = {
"@type": ld_type,
"filename": basename(fp),
"mediatype": idk.mediatype_guess(data, file_name=basename(fp)),
}
# Generate Data-Code and Instance-Code
data = idk.code_data(fp)
instance = idk.code_instance(fp)
iscc_code = ic.gen_iscc_code_v0([data.iscc, instance.iscc])
# Collect metadata from UNIT processors
meta.update(instance.dict())
meta.update(data.dict())
meta.update(iscc_code)
return idk.IsccMeta.construct(**meta)
if __name__ == '__main__':
fp = "/path/to/test.h5"
iscc_meta = code_iscc_sum(fp)
print(iscc_meta.json(indent=2)) The output then looks like this: {
"@context": "http://purl.org/iscc/context",
"@type": "Dataset",
"$schema": "http://purl.org/iscc/schema",
"iscc": "ISCC:KUABEKSKSEJGHCQXVBIYEODRUPP5S",
"filename": "test.h5",
"filesize": 15072,
"mediatype": "application/x-hdf",
"datahash": "1e20a851823871a3dfd92c49834eb03ceba5b182f0e5095e6bf532f8774ff240172f"
} This is the structure of the ISCC SUM: (see INSPECT tab on https://huggingface.co/spaces/iscc/iscc-playground) The Data-Code component of the ISCC SUM would allow to match HDF files that have minimal changes in the raw bitstream. How good that works in practical terms will depend on how deterministic HDF file encoding is. If you want support for higher level ISCC-UNITs than it gets more complicated. In the end it is use-case dependent. You would need to think about what does HDF file content-similarty mean and what should similarity matching accomplish. For some guidance see: https://eval.iscc.codes/similarity/
Happy to discuss any ideas around metadata/content extraction from HDF files. |
Hi @titusz , thanks for your prompt reply. For my HDF files I was using the NONE (0110) subtype with any available metadata I had without extracting them from the file itself. Indeed, as you said it would be very interesting if metadata extraction and content comparison occurred at the HDF level. While metadata extraction could be trivial with python, measuring HDF similarity might require some thinking. Luckily https://docs.h5py.org/en/latest/index.html can be of help. |
Yes, if you have external metadata then the ISCC NONE SubType is a perfectly valid type for internal use-cases. The problem with using custom/external metadata is interoperability. If other parties only have the HDF file they can likely not reproduce the Meta-Code unit. As far as I see HDF supports internal metadata ( I guess the best way forward would be to first create some kind of plugin system with hooks for handling specific file types. We could than create separate python packages that could register themselves. There are already some file types supported by the |
Speaking about metadata, how can the attributes in an |
Well the metadata situation is tricky. The SDK tries to support metadata extraction, mapping and embedding as good as that is possible for well known file types. See: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/metadata.py. The extraction, embedding, and mapping logic is in the individual modules per modality. For example: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/image.py#L255 We distinguish between different kinds of metadata. Metadata that is used for calculation of the Meta-Code is called Seed-Metadata. Those are only 3 fields: For industry specific metadata you would usually define or pick an existing schema, serialize it into a Data-URL and put it into the |
Indeed, metadata definitions are research-field dependent. In a data management system that handles both data and metadata such as iRODS, ISCC codes could nonetheless be of great use. I created a POC at https://github.com/ll4strw/python-irodsclient-iscc/tree/main if you are interested. |
Good morning, would it be possible to add support for the following data format, please?
https://en.wikipedia.org/wiki/Hierarchical_Data_Format
Thanks
The text was updated successfully, but these errors were encountered: