Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Indo_MultiModal_LAION #308

Open
SamuelCahyawijaya opened this issue Oct 2, 2022 · 1 comment
Open

Create dataset loader for Indo_MultiModal_LAION #308

SamuelCahyawijaya opened this issue Oct 2, 2022 · 1 comment
Assignees

Comments

@SamuelCahyawijaya
Copy link
Member

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?id_mm_laion

Dataset id_mm_laion
Description Indo_MultiModal_LAION is a translated subset of the LAION-400M dataset with 70M image-text pairs specifically meant to be used for vision-language pre-training in Indonesian language. LAION-400M is a dataset with 400M English (image, text) pairs, filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3. The threshold of 0.3 had been determined through human evaluations and seemed to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. More info for LAION-400M: https://laion.ai/blog/laion-400-open-dataset/.
License From LAION-400M: We distribute the metadata dataset (the parquet files) under the most open Creative Common CC-BY 4.0 license, which poses no particular restriction. The images are under their copyright.
@acul3
Copy link
Contributor

acul3 commented Oct 4, 2022

#self-assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants