Skip to content

Some problematic postag predictions of spacy - when there is punctuation #12478

Discussion options

You must be logged in to vote

Hey dicleozturk,

The Tokenizer component in spaCy is rule-based so adding more training data would not change the tokenization. We wrote about how the Tokenizer works here: https://spacy.io/usage/linguistic-features#how-tokenizer-works. You can add some pattern to the suffix_search potentially to keep the noun(s) "optional plural" patter together without splitting.
Another option is to add post-processing using retokenizer.merge: https://spacy.io/api/doc#retokenize.

One thing to consider is what the correct POS tag of a noun such as such as "submission(s)" should be. Its neither singular nor plural or is it both? Each token can have a only a single POS tag so one needs to decide for their…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tagger Feature: Part-of-speech tagger
2 participants