Some problematic postag predictions of spacy - when there is punctuation #12478

dicleozturk · 2023-03-28T08:18:32Z

dicleozturk
Mar 28, 2023

hello,
i want to share an observation of mine, about spacy's incorrect postaggings.
for a sentence like "return your submission(s) to X by mail", the tags are:
('return', 'VB', 'VERB'),
('your', 'PRP$', 'PRON'),
('submission(s', 'NN', 'NOUN'),
(')', '-RRB-', 'PUNCT'),
('to', 'IN', 'ADP'),
('NFA', 'NNP', 'PROPN'),
('by', 'IN', 'ADP'),
('mail', 'NN', 'NOUN').

as you can see, < submission(s) > is segmented into two parts, "submission(s" and ")". when there is a punctuation inside a word, not for stylistic purposes but to indicate a new connotation, spacy flocks at those cases.

any ideas on how to resolve this problem? adding some rules or feeding the training data with such words?

Answered by kadarakos

Apr 3, 2023

Hey dicleozturk,

The Tokenizer component in spaCy is rule-based so adding more training data would not change the tokenization. We wrote about how the Tokenizer works here: https://spacy.io/usage/linguistic-features#how-tokenizer-works. You can add some pattern to the suffix_search potentially to keep the noun(s) "optional plural" patter together without splitting.
Another option is to add post-processing using retokenizer.merge: https://spacy.io/api/doc#retokenize.

One thing to consider is what the correct POS tag of a noun such as such as "submission(s)" should be. Its neither singular nor plural or is it both? Each token can have a only a single POS tag so one needs to decide for their…

View full answer

kadarakos · 2023-04-03T12:42:00Z

kadarakos
Apr 3, 2023

Hey dicleozturk,

The Tokenizer component in spaCy is rule-based so adding more training data would not change the tokenization. We wrote about how the Tokenizer works here: https://spacy.io/usage/linguistic-features#how-tokenizer-works. You can add some pattern to the suffix_search potentially to keep the noun(s) "optional plural" patter together without splitting.
Another option is to add post-processing using retokenizer.merge: https://spacy.io/api/doc#retokenize.

One thing to consider is what the correct POS tag of a noun such as such as "submission(s)" should be. Its neither singular nor plural or is it both? Each token can have a only a single POS tag so one needs to decide for their use case which one is the more appropriate.

The last thing to consider is that when modifying the Tokenizer while keeping the rest of the pipeline constant can lead to wrong predictions, because the pipeline was trained on a different tokenization. For example here I hacked a token_match into the Tokenizer to illustrate:

nlp = spacy.load("en_core_web_lg")
default = spacy.tokenizer._get_regex_pattern(nlp.Defaults.token_match)
updated = f"({default}|(s))"
nlp.tokenizer.token_match = re.compile(updated).match
doc = nlp("return your submission(s) to X by mail")
print([(token, token.pos_) for token in doc])

This prints:

[(return, 'VERB'),
 (your, 'PRON'),
 (submission(s), 'PROPN'),
 (to, 'ADP'),
 (X, 'NOUN'),
 (by, 'ADP'),
 (mail, 'NOUN')]

As you can see "submission(s)" here is a "PROPN" (proper noun) which is incorrect. Finally, if you are interested you can checkout the character-level learnable experimental tokenizers: https://github.com/explosion/spacy-experimental#trainable-character-based-tokenizers

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some problematic postag predictions of spacy - when there is punctuation #12478

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Some problematic postag predictions of spacy - when there is punctuation #12478

dicleozturk Mar 28, 2023

Replies: 1 comment

kadarakos Apr 3, 2023

dicleozturk
Mar 28, 2023

kadarakos
Apr 3, 2023