divergence in token POS and TAG #11730

asquare · 2022-11-01T22:16:39Z

asquare
Nov 1, 2022

I'm seeing a divergence in a token's POS and TAG, wondering which one is authoritative? For example, in a certain sentence starting with get -

{'text': 'get', 'lemma': 'get', 'pos': 'VERB', 'dep': 'ROOT', 'tag': 'VB'}

Here POS and TAG are in agreement, but with a slightly different sentence also starting with get -

{'text': 'get', 'lemma': 'get', 'pos': 'AUX', 'dep': 'aux', 'tag': 'VB'}

Here POS and TAG have diverged. Is it safer to rely on TAG rather than POS? I've been experimenting with different model sizes, this behavior has been observed with en_core_web_lg. Thanks

spaCy version: 3.4.2
Platform: macOS-13.0-arm64-arm-64bit
Python version: 3.10.6
Pipelines: en_core_web_md (3.4.1), en_core_web_sm (3.4.1), en_core_web_trf (3.4.1), en_core_web_lg (3.4.1)

Answered by polm

Nov 2, 2022

Quick markdown note: JSON doesn't allow single quotes, so if you mark your blocks as JSON they show up highlighted completely in red as invalid. This looks like Python repr output, so I changed the blocks to Python. It would also be fine to not specify a language.

Anyway, about your question, .pos_ and .tag_ are related but different things - it's not a question of one being "better" or "authoritative". POS is Universal Dependencies tags, which are coarse-grained and designed to be transferable between languages. The values in .tag_ are language-specific tags, which are more fine grained and typically unique to a given language.

Which one you should rely on depends on what you're using th…

View full answer

polm · 2022-11-02T03:33:46Z

polm
Nov 2, 2022

Quick markdown note: JSON doesn't allow single quotes, so if you mark your blocks as JSON they show up highlighted completely in red as invalid. This looks like Python repr output, so I changed the blocks to Python. It would also be fine to not specify a language.

Anyway, about your question, .pos_ and .tag_ are related but different things - it's not a question of one being "better" or "authoritative". POS is Universal Dependencies tags, which are coarse-grained and designed to be transferable between languages. The values in .tag_ are language-specific tags, which are more fine grained and typically unique to a given language.

Which one you should rely on depends on what you're using them for.

2 replies

adrianeboyd Nov 2, 2022

For the trained English pipelines, POS is derived from tag and parse annotation using hand-written rules in the attribute ruler. The rules use the word form, TAG, and DEP (if DEP is available) to convert the fine-grained PTB tags to UPOS tags. You may see incorrect POS if the tag or parse is incorrect, or if it's a case that the conversion can't cover perfectly because the tag sets don't line up 1-to-1.

In the case above you get AUX instead of VERB because of the parse.

asquare Nov 2, 2022
Author

Thanks for your quick answers Paul and Adriane. I'm looking for stability... I don't want to have to handle the AUX special case on potentially small changes to a sentence. So I've switched to TAG instead of POS. 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

divergence in token POS and TAG #11730

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

divergence in token POS and TAG #11730

asquare Nov 1, 2022

Replies: 1 comment · 2 replies

polm Nov 2, 2022

adrianeboyd Nov 2, 2022

asquare Nov 2, 2022 Author

asquare
Nov 1, 2022

Replies: 1 comment 2 replies

polm
Nov 2, 2022

asquare Nov 2, 2022
Author