Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple languages? #64

Open
yves-chevallier opened this issue Aug 10, 2020 · 5 comments
Open

Multiple languages? #64

yves-chevallier opened this issue Aug 10, 2020 · 5 comments

Comments

@yves-chevallier
Copy link

I have a document written in French and English. Is this possible to have something like:

spelling_lang=['en_US', 'fr_CH']
@dhellmann
Copy link
Member

That isn't supported today, but should be possible to implement.

The SpellingChecker would need to support loading several dictionaries and only reporting an error if the token cannot be found in any of them. It would also have to track suggestions across all dictionaries and include them all.

The configuration options would need to support specifying multiple languages, as you suggest.

And it might be useful to have a directive to control the dictionaries in use for individual files, although that isn't strictly necessary.

Are you interested in contributing those changes?

@yves-chevallier
Copy link
Author

I don't know how much I am interested. I am writing a quite long documentation (in french with some english) using sphinx and it is very important for me to have a CI roughly doing a check spell. However I didn't find any good package do to this and I am not really convinced by enchant which doesn't have any good tokenizer...

For example words such as Backus-Naur should be written with a dash and supported in the dictionary as is. Currently I have two words in my dictionary: Backus and Naur because the tokenizer don't understand compound words. Also some words cannot be written with a capital letter such as C keywords (while, for, return). sphinxcontrib.spelling should therefore support the text in the code-block directives and it should support the language keywords by default. Another very annoying/important issue with the spelling is the way the user-dictionaries works. I would much prefer having a support for regex patterns. Such as for the verb eat: [Ee]at(s?|en)|ate or manger in french [Mm]ange(s|ons|z|nt|ai[st])...

It seems sphinxcontrib.spelling is the best candidate for now, but not a good one for French :(

@dhellmann
Copy link
Member

Yes, I suppose the quality of support for French terms depends on the underlying library for tokenizing and the dictionary for various conjugated forms of words.

It would probably be possible to support a tokenizer that recognizes technical terms like Backus-Naur, but I haven't looked into that because I haven't needed it myself, yet.

Language-specific terms within code-blocks are interesting. Perhaps the tokenizer for the syntax highlighter could be reused for that.

@dhellmann
Copy link
Member

I should also say that most of the code base for sphinxcontrib-spelling doesn't care about which underlying spelling checker is used, so if there is a different library that works better for other languages we could make that pluggable (either based on the language or based on a new configuration option) and hide the differences in the SpellingChecker class.

@bmrec
Copy link

bmrec commented Aug 6, 2021

I vote for this feature. Now I use a workaround - merged dictionary (en+ru).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants