Add support for HDF file format #109

ll4strw · 2024-05-17T08:16:43Z

Good morning, would it be possible to add support for the following data format, please?

https://en.wikipedia.org/wiki/Hierarchical_Data_Format

Thanks

titusz · 2024-05-17T10:21:02Z

Hello @ll4strw. You can already create standard conformant ISCCs for HDF files (actually for any file type) using the ISCC SUM SubType

Here is an example script to generate ISCCs for any file type:

"""Create an ISCC-CODE for any file"""
from os.path import basename
import iscc_sdk as idk
import iscc_core as ic


def code_iscc_sum(fp, ld_type="Dataset"):
    # type: (str, str) -> idk.IsccMeta
    """
    Generate a minimal ISCC-CODE (SubType SUM)

    The ISCC-CODE SUM is a combination of the Data-Code and Instance-Code UNITS.
    As such it can handle any file irrespective of the file format.

    :param str fp: Filepath used for ISCC-CODE creation.
    :param str ld_type: JSON-LD schema.org type of the identified file
    :return: ISCC metadata including ISCC-CODE
    :rtype: IsccMeta
    """

    # Prepare basic metadata
    with open(fp, "rb") as infile:
        data = infile.read(4096)
    meta = {
        "@type": ld_type,
        "filename": basename(fp),
        "mediatype": idk.mediatype_guess(data, file_name=basename(fp)),
    }

    # Generate Data-Code and Instance-Code
    data = idk.code_data(fp)
    instance = idk.code_instance(fp)
    iscc_code = ic.gen_iscc_code_v0([data.iscc, instance.iscc])

    # Collect metadata from UNIT processors
    meta.update(instance.dict())
    meta.update(data.dict())
    meta.update(iscc_code)

    return idk.IsccMeta.construct(**meta)



if __name__ == '__main__':
    fp = "/path/to/test.h5"
    iscc_meta = code_iscc_sum(fp)
    print(iscc_meta.json(indent=2))

The output then looks like this:

{
  "@context": "http://purl.org/iscc/context",
  "@type": "Dataset",
  "$schema": "http://purl.org/iscc/schema",
  "iscc": "ISCC:KUABEKSKSEJGHCQXVBIYEODRUPP5S",
  "filename": "test.h5",
  "filesize": 15072,
  "mediatype": "application/x-hdf",
  "datahash": "1e20a851823871a3dfd92c49834eb03ceba5b182f0e5095e6bf532f8774ff240172f"
}

This is the structure of the ISCC SUM:

(see INSPECT tab on https://huggingface.co/spaces/iscc/iscc-playground)

The Data-Code component of the ISCC SUM would allow to match HDF files that have minimal changes in the raw bitstream. How good that works in practical terms will depend on how deterministic HDF file encoding is.

If you want support for higher level ISCC-UNITs than it gets more complicated. In the end it is use-case dependent. You would need to think about what does HDF file content-similarty mean and what should similarity matching accomplish. For some guidance see: https://eval.iscc.codes/similarity/

Meta-Code: Would require some standardized way of extracting metadata from HDF files
Content-Code: Would require "content"-extraction from HDF files. But what is the content modality? TEXT, IMAGE, AUDIO, VIDEO ...
Same goes for an eventual Semantic-Code

Happy to discuss any ideas around metadata/content extraction from HDF files.

ll4strw · 2024-05-17T10:59:17Z

Hi @titusz , thanks for your prompt reply. For my HDF files I was using the NONE (0110) subtype with any available metadata I had without extracting them from the file itself. Indeed, as you said it would be very interesting if metadata extraction and content comparison occurred at the HDF level. While metadata extraction could be trivial with python, measuring HDF similarity might require some thinking. Luckily https://docs.h5py.org/en/latest/index.html can be of help.

titusz · 2024-05-17T12:08:24Z

Yes, if you have external metadata then the ISCC NONE SubType is a perfectly valid type for internal use-cases. The problem with using custom/external metadata is interoperability. If other parties only have the HDF file they can likely not reproduce the Meta-Code unit. As far as I see HDF supports internal metadata (Attributes) attached to Group and Dataset objects. It shouldn´t be too hard to create some deterministic metadata extraction from the objects and their attributes.

I guess the best way forward would be to first create some kind of plugin system with hooks for handling specific file types. We could than create separate python packages that could register themselves. There are already some file types supported by the iscc-sdk which I would love to put into a separate package. Otherwise this project will soon suffer from dependency hell :)

ll4strw · 2024-05-17T13:13:03Z

Speaking about metadata, how can the attributes in an iscc-schema.IsccMetaobject be filled automatically? From your example above, you create a dictionary which contains all iscc unit codes plus the total iscc code to construct a IsccMeta object. Most of the attributes in the schema will have a None value though. Do I understand correctly that the idea is to add iscc metadata to the original digital object file so that a consistent iscc meta code can be produced? Thanks

titusz · 2024-05-17T15:05:24Z

Well the metadata situation is tricky. The SDK tries to support metadata extraction, mapping and embedding as good as that is possible for well known file types. See: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/metadata.py. The extraction, embedding, and mapping logic is in the individual modules per modality. For example: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/image.py#L255

We distinguish between different kinds of metadata. Metadata that is used for calculation of the Meta-Code is called Seed-Metadata. Those are only 3 fields: name, description, meta. There are other fields that are embeddable/extractable but they are purely informational and not processed algorithmically.

For industry specific metadata you would usually define or pick an existing schema, serialize it into a Data-URL and put it into the meta field. The long version: https://ieps.iscc.codes/iep-0002/ and also here https://ieps.iscc.codes/iep-0012/

ll4strw · 2024-06-03T14:14:05Z

Indeed, metadata definitions are research-field dependent. In a data management system that handles both data and metadata such as iRODS, ISCC codes could nonetheless be of great use. I created a POC at https://github.com/ll4strw/python-irodsclient-iscc/tree/main if you are interested.

titusz added the enhancement New feature or request label May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for HDF file format #109

Add support for HDF file format #109

ll4strw commented May 17, 2024

titusz commented May 17, 2024 •

edited

Loading

ll4strw commented May 17, 2024

titusz commented May 17, 2024 •

edited

Loading

ll4strw commented May 17, 2024

titusz commented May 17, 2024 •

edited

Loading

ll4strw commented Jun 3, 2024

Add support for HDF file format #109

Add support for HDF file format #109

Comments

ll4strw commented May 17, 2024

titusz commented May 17, 2024 • edited Loading

ll4strw commented May 17, 2024

titusz commented May 17, 2024 • edited Loading

ll4strw commented May 17, 2024

titusz commented May 17, 2024 • edited Loading

ll4strw commented Jun 3, 2024

titusz commented May 17, 2024 •

edited

Loading

titusz commented May 17, 2024 •

edited

Loading

titusz commented May 17, 2024 •

edited

Loading