Some problematic postag predictions of spacy - when there is punctuation #12478
-
hello, as you can see, < submission(s) > is segmented into two parts, "submission(s" and ")". when there is a punctuation inside a word, not for stylistic purposes but to indicate a new connotation, spacy flocks at those cases. any ideas on how to resolve this problem? adding some rules or feeding the training data with such words? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hey dicleozturk, The One thing to consider is what the correct POS tag of a noun such as such as "submission(s)" should be. Its neither singular nor plural or is it both? Each The last thing to consider is that when modifying the nlp = spacy.load("en_core_web_lg")
default = spacy.tokenizer._get_regex_pattern(nlp.Defaults.token_match)
updated = f"({default}|(s))"
nlp.tokenizer.token_match = re.compile(updated).match
doc = nlp("return your submission(s) to X by mail")
print([(token, token.pos_) for token in doc]) This prints: [(return, 'VERB'),
(your, 'PRON'),
(submission(s), 'PROPN'),
(to, 'ADP'),
(X, 'NOUN'),
(by, 'ADP'),
(mail, 'NOUN')] As you can see |
Beta Was this translation helpful? Give feedback.
Hey dicleozturk,
The
Tokenizer
component inspaCy
is rule-based so adding more training data would not change the tokenization. We wrote about how theTokenizer
works here: https://spacy.io/usage/linguistic-features#how-tokenizer-works. You can add some pattern to thesuffix_search
potentially to keep thenoun(s)
"optional plural" patter together without splitting.Another option is to add post-processing using
retokenizer.merge
: https://spacy.io/api/doc#retokenize.One thing to consider is what the correct POS tag of a noun such as such as "submission(s)" should be. Its neither singular nor plural or is it both? Each
token
can have a only a single POS tag so one needs to decide for their…