Skip to content

OntoLex-lemon dictionaries provided the Applied Computational Linguistics lab at Goethe Universität Frankfurt am Main, Germany, and by the research group Linked Open Dictionaries (LiODi, 2015-2020, funded by BMBF)

License

Notifications You must be signed in to change notification settings

max-ionov/acoli-dicts

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ACoLi Dicts

Ontolex-lemon dictionaries provided by the Applied Computational Linguistics (ACoLi) lab at Goethe Universität Frankfurt am Main, Germany, and the associated research group Linked Open Dictionaries (LiODi, 2015-2020)

The project "Linked Open Dictionaries" (LiODi, 2015-2020) is a BMBF-funded research group run by the Applied Computational Linguistics (ACoLi) lab in collaboration with the Institute for Empirical Linguistics at Goethe University Frankfurt. It aims at creating Linked Open Data representations of dictionaries and the development of an infrastructure and methodologies for their practical application in language contact studies, mostly in Eurasia and the Caucasus area. See http://acoli.informatik.uni-frankfurt.de/liodi for details. As a technical basis, we employ lemon/ontolex (https://www.w3.org/2016/05/ontolex/) for data modelling, OLiA (http://purl.org/olia) for representing grammatical information, lexvo (http://lexvo.org) for ISO 639 language identifiers and glottolog (http://glottolog.org) for identifiers of non-ISO-639 language varieties.

At the moment, we provide OntoLex-lemon and TIAD-TSV editions of open source dictionaries for more than 400 language varieties and more than 2500 language pairs (stable and experimental), with more than 3000 lexical data sets in total, see statistics below. Note that we exclude most smaller data sets (with less than 10,000 translation pairs) in these counts. Additional data has been converted, but is still awaiting copyright clearance.

dictionary graph

Overview

  languages lexical data sets license OntoLex/RDF data TIAD/TSV data comments
Apertium 46 55 GPL apertium/apertium-rdf-2019-02-03 (*.rdf.zip) apertium/apertium-rdf-2019-02-03 (trans*tsv.gz) modeling based on http://linguistic.linkeddata.es/apertium/, designed for machine translation
FreeDict 45 145 GPL freedict/freedict-rdf-2019-02-05 (*/*.ttl.gz) freedict/freedict-rdf-2019-02-05 (*/*.tsv.gz) plain word lists, user-generated content
DBnary 119* 275* CC-BY-SA 3.0 external dbnary/dbnary-tiad-2019-02-16 * counted only language pairs with 10,000+ entries, user-generated content
PanLex 194* 1651** CC0 panlex/panlex-20191001-csv-rdf panlex/biling-tsv * only language pairs with 10.000 entries; ** TIAD-TSV files
MUSE 45 107 CC-BY-NC 4.0 muse/muse-rdf-2020-06-12 muse-tsv-2020-06-12 machine-generated, high-precision wordlist
Wikidata * * CC0 https://www.wikidata.org (external) wikidata/wikidata-tsv-2020-06-24 * >400k translation pairs, > 90k language pairs, but very sparse
OMW 34 40* open source external omw/tsv * conservative estimate, restricted to combinations of OMW files with identical licenses
IDS 234* 792*,** CC-BY 4.0 ids/ontolex ids/tsv * counted only language pairs with >10k translations, ** TIAD TSV files
XDXF 51 107 GPL experimental/xdxf/xdxf-rdf-2019-02-22 (*/*.ttl.gz) experimental/xdxf/xdxf-rdf-2019-02-22 (*/*.tsv.gz) experimental
free-dict.de 2 1 "free" experimental/free-dict.de/free-dict-de-2020-01-02 (*.ttl.gz) experimental/free-dict.de/free-dict-de-2020-01-02 (*.tsv.gz) experimental (partial)
StarDict 32 130 "open"/"free" experimental/stardict/stardict-2020-01-04 (*/*.ttl.gz) experimental/stardict/stardict-2020-01-04 (*/*.tsv.gz) experimental (partial)
total 430 3143

subdirectories

  • stable/ data releases
  • experimental/ work in progress, contains converters/build scripts and resulting data by several individual contributors, including student projects

acknowledgements, licensing and references

The ACoLi Dictionary Graph has been created and continues to be developed at the Applied Computational Linguistics Lab at Goethe Universität Frankfurt, Germany since 2014 in the context of numerous research projects, including

To refer to the dataset as a whole in scientific publications, please refer to Chiarcos et al. (2020):

@inproceedings{chiarcos2020acoli,
  title={The ACoLi Dictionary Graph},
  author={Chiarcos, Christian and F{\"a}th, Christian and Ionov, Maxim},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={3281--3290},
  year={2020}
}

All datasets are published under open or non-commercial licenses. We put our RDF and TIAD-TSV editions are put under the same license as the underlying source data. For detailed acknowledgements and licensing of individual datasets see the respective subdirectories.

About

OntoLex-lemon dictionaries provided the Applied Computational Linguistics lab at Goethe Universität Frankfurt am Main, Germany, and by the research group Linked Open Dictionaries (LiODi, 2015-2020, funded by BMBF)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 33.8%
  • HTML 21.3%
  • XSLT 16.6%
  • Java 12.6%
  • Python 10.6%
  • TeX 2.3%
  • Other 2.8%