Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HDF file format #109

Open
ll4strw opened this issue May 17, 2024 · 6 comments
Open

Add support for HDF file format #109

ll4strw opened this issue May 17, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@ll4strw
Copy link

ll4strw commented May 17, 2024

Good morning, would it be possible to add support for the following data format, please?

https://en.wikipedia.org/wiki/Hierarchical_Data_Format

Thanks

@titusz
Copy link
Member

titusz commented May 17, 2024

Hello @ll4strw. You can already create standard conformant ISCCs for HDF files (actually for any file type) using the ISCC SUM SubType

Here is an example script to generate ISCCs for any file type:

"""Create an ISCC-CODE for any file"""
from os.path import basename
import iscc_sdk as idk
import iscc_core as ic


def code_iscc_sum(fp, ld_type="Dataset"):
    # type: (str, str) -> idk.IsccMeta
    """
    Generate a minimal ISCC-CODE (SubType SUM)

    The ISCC-CODE SUM is a combination of the Data-Code and Instance-Code UNITS.
    As such it can handle any file irrespective of the file format.

    :param str fp: Filepath used for ISCC-CODE creation.
    :param str ld_type: JSON-LD schema.org type of the identified file
    :return: ISCC metadata including ISCC-CODE
    :rtype: IsccMeta
    """

    # Prepare basic metadata
    with open(fp, "rb") as infile:
        data = infile.read(4096)
    meta = {
        "@type": ld_type,
        "filename": basename(fp),
        "mediatype": idk.mediatype_guess(data, file_name=basename(fp)),
    }

    # Generate Data-Code and Instance-Code
    data = idk.code_data(fp)
    instance = idk.code_instance(fp)
    iscc_code = ic.gen_iscc_code_v0([data.iscc, instance.iscc])

    # Collect metadata from UNIT processors
    meta.update(instance.dict())
    meta.update(data.dict())
    meta.update(iscc_code)

    return idk.IsccMeta.construct(**meta)



if __name__ == '__main__':
    fp = "/path/to/test.h5"
    iscc_meta = code_iscc_sum(fp)
    print(iscc_meta.json(indent=2))

The output then looks like this:

{
  "@context": "http://purl.org/iscc/context",
  "@type": "Dataset",
  "$schema": "http://purl.org/iscc/schema",
  "iscc": "ISCC:KUABEKSKSEJGHCQXVBIYEODRUPP5S",
  "filename": "test.h5",
  "filesize": 15072,
  "mediatype": "application/x-hdf",
  "datahash": "1e20a851823871a3dfd92c49834eb03ceba5b182f0e5095e6bf532f8774ff240172f"
}

This is the structure of the ISCC SUM:

image

(see INSPECT tab on https://huggingface.co/spaces/iscc/iscc-playground)

The Data-Code component of the ISCC SUM would allow to match HDF files that have minimal changes in the raw bitstream. How good that works in practical terms will depend on how deterministic HDF file encoding is.

If you want support for higher level ISCC-UNITs than it gets more complicated. In the end it is use-case dependent. You would need to think about what does HDF file content-similarty mean and what should similarity matching accomplish. For some guidance see: https://eval.iscc.codes/similarity/

  • Meta-Code: Would require some standardized way of extracting metadata from HDF files
  • Content-Code: Would require "content"-extraction from HDF files. But what is the content modality? TEXT, IMAGE, AUDIO, VIDEO ...
  • Same goes for an eventual Semantic-Code

Happy to discuss any ideas around metadata/content extraction from HDF files.

@titusz titusz added the enhancement New feature or request label May 17, 2024
@ll4strw
Copy link
Author

ll4strw commented May 17, 2024

Hi @titusz , thanks for your prompt reply. For my HDF files I was using the NONE (0110) subtype with any available metadata I had without extracting them from the file itself. Indeed, as you said it would be very interesting if metadata extraction and content comparison occurred at the HDF level. While metadata extraction could be trivial with python, measuring HDF similarity might require some thinking. Luckily https://docs.h5py.org/en/latest/index.html can be of help.

@titusz
Copy link
Member

titusz commented May 17, 2024

Yes, if you have external metadata then the ISCC NONE SubType is a perfectly valid type for internal use-cases. The problem with using custom/external metadata is interoperability. If other parties only have the HDF file they can likely not reproduce the Meta-Code unit. As far as I see HDF supports internal metadata (Attributes) attached to Group and Dataset objects. It shouldn´t be too hard to create some deterministic metadata extraction from the objects and their attributes.

I guess the best way forward would be to first create some kind of plugin system with hooks for handling specific file types. We could than create separate python packages that could register themselves. There are already some file types supported by the iscc-sdk which I would love to put into a separate package. Otherwise this project will soon suffer from dependency hell :)

@ll4strw
Copy link
Author

ll4strw commented May 17, 2024

Speaking about metadata, how can the attributes in an iscc-schema.IsccMetaobject be filled automatically? From your example above, you create a dictionary which contains all iscc unit codes plus the total iscc code to construct a IsccMeta object. Most of the attributes in the schema will have a None value though. Do I understand correctly that the idea is to add iscc metadata to the original digital object file so that a consistent iscc meta code can be produced? Thanks

@titusz
Copy link
Member

titusz commented May 17, 2024

Well the metadata situation is tricky. The SDK tries to support metadata extraction, mapping and embedding as good as that is possible for well known file types. See: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/metadata.py. The extraction, embedding, and mapping logic is in the individual modules per modality. For example: https://github.com/iscc/iscc-sdk/blob/main/iscc_sdk/image.py#L255

We distinguish between different kinds of metadata. Metadata that is used for calculation of the Meta-Code is called Seed-Metadata. Those are only 3 fields: name, description, meta. There are other fields that are embeddable/extractable but they are purely informational and not processed algorithmically.

For industry specific metadata you would usually define or pick an existing schema, serialize it into a Data-URL and put it into the meta field. The long version: https://ieps.iscc.codes/iep-0002/ and also here https://ieps.iscc.codes/iep-0012/

@ll4strw
Copy link
Author

ll4strw commented Jun 3, 2024

Indeed, metadata definitions are research-field dependent. In a data management system that handles both data and metadata such as iRODS, ISCC codes could nonetheless be of great use. I created a POC at https://github.com/ll4strw/python-irodsclient-iscc/tree/main if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants