Faster language detection #2

TobiasJu · 2019-09-11T13:44:49Z

Hi there,
currently the spaCy-language detection takes quite a while, because its doing tokenisation and sentence splitting and what not in the background.
I just want to have the language for the doc, can i somehow improve the speed of spacy-languagedetect?
regards!

TobiasJu · 2019-09-11T15:40:33Z

Just to give you a reference, as test i detected the language of about 4000 docs, with average 100 words:

language_guess: 26s
cld2: 18s
language_id: 39s
spaCy-langdetect: 3334s

which equals 55 minutes. Which makes this package completely useless for my usecase. Which is quite sad, because spacy is an awesome lib!

TobiasJu · 2019-09-13T13:30:54Z

So today i digged in your lib and found the detector_factory and detect functions, which can be imported with:

 from langdetect import detector_factory
 from langdetect import detect

Which than can be directly accessed with:

detected = detector_factory.detect_langs("Das ist ein Test-Text für die Spracherkennung.")
print(detected)

[de:0.9999983500527911]

New time is: 86s
I hope this will help future users of this lib.

PS: No need for spaCy at all. This works because in your spacy_langdetect.py you do: "from langdetect import detect_langs" which can be directly imported as shown. Which leads to the question, why bother importing spaCy and doing all the unnecessary steps for a simple language detection like this?

JonanOribe · 2020-02-05T10:00:21Z

detector_factory.detect_langs

Thanks for the approach, really improves the performance

MichaelJanz · 2020-10-14T08:31:41Z

Thanks for that great answer!
As you specify the model used, when creating the pipeline, I am wondering which model is used by default?

lsmith77 · 2022-02-23T13:27:37Z

we are currently fasttext which is performing quite well. however since we are already using spacy models (one for english and one for german) in other parts of the app, I figured it would be interesting to use the spacy models for language detection as well.

but I am also a bit confused about how it works since it seems like it only uses one language model at a time. and now it seems to indicate that this solution here is just the integration of langdetect in spacy and not a spacy based language detection.

we used langdetect in the past already and found it not accurate enough compared to fasttext.

TobiasJu changed the title ~~Fast~~ Faster language detection Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster language detection #2

Faster language detection #2

TobiasJu commented Sep 11, 2019

TobiasJu commented Sep 11, 2019 •

edited

Loading

TobiasJu commented Sep 13, 2019 •

edited

Loading

JonanOribe commented Feb 5, 2020

MichaelJanz commented Oct 14, 2020

lsmith77 commented Feb 23, 2022

Faster language detection #2

Faster language detection #2

Comments

TobiasJu commented Sep 11, 2019

TobiasJu commented Sep 11, 2019 • edited Loading

TobiasJu commented Sep 13, 2019 • edited Loading

JonanOribe commented Feb 5, 2020

MichaelJanz commented Oct 14, 2020

lsmith77 commented Feb 23, 2022

TobiasJu commented Sep 11, 2019 •

edited

Loading

TobiasJu commented Sep 13, 2019 •

edited

Loading