Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster language detection #2

Open
TobiasJu opened this issue Sep 11, 2019 · 5 comments
Open

Faster language detection #2

TobiasJu opened this issue Sep 11, 2019 · 5 comments

Comments

@TobiasJu
Copy link

Hi there,
currently the spaCy-language detection takes quite a while, because its doing tokenisation and sentence splitting and what not in the background.
I just want to have the language for the doc, can i somehow improve the speed of spacy-languagedetect?
regards!

@TobiasJu TobiasJu changed the title Fast Faster language detection Sep 11, 2019
@TobiasJu
Copy link
Author

TobiasJu commented Sep 11, 2019

Just to give you a reference, as test i detected the language of about 4000 docs, with average 100 words:

language_guess: 26s
cld2: 18s
language_id: 39s
spaCy-langdetect: 3334s

which equals 55 minutes. Which makes this package completely useless for my usecase. Which is quite sad, because spacy is an awesome lib!

@TobiasJu
Copy link
Author

TobiasJu commented Sep 13, 2019

So today i digged in your lib and found the detector_factory and detect functions, which can be imported with:

 from langdetect import detector_factory
 from langdetect import detect

Which than can be directly accessed with:

detected = detector_factory.detect_langs("Das ist ein Test-Text für die Spracherkennung.")
print(detected)

[de:0.9999983500527911]

New time is: 86s
I hope this will help future users of this lib.

PS: No need for spaCy at all. This works because in your spacy_langdetect.py you do: "from langdetect import detect_langs" which can be directly imported as shown. Which leads to the question, why bother importing spaCy and doing all the unnecessary steps for a simple language detection like this?

@JonanOribe
Copy link

detector_factory.detect_langs

Thanks for the approach, really improves the performance

@MichaelJanz
Copy link

Thanks for that great answer!
As you specify the model used, when creating the pipeline, I am wondering which model is used by default?

@lsmith77
Copy link

we are currently fasttext which is performing quite well. however since we are already using spacy models (one for english and one for german) in other parts of the app, I figured it would be interesting to use the spacy models for language detection as well.

but I am also a bit confused about how it works since it seems like it only uses one language model at a time. and now it seems to indicate that this solution here is just the integration of langdetect in spacy and not a spacy based language detection.

we used langdetect in the past already and found it not accurate enough compared to fasttext.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants