Possible ORG misidentification #13438

grabastart · 2024-04-13T13:25:24Z

grabastart
Apr 13, 2024

How to reproduce the behaviour

import spacy
print(spacy.__version__)

model='en_core_web_sm'
nlp = spacy.load(model)	

query = "Identify top 4 open source small language model that can run on a personal computer."

# Use spaCy for Named Entity Recognition
doc = nlp(query)
for ent in doc.ents:
    # original code
    if ent.label_ in ['PERSON', 'ORG']:
    # display PERSON or ORG found
        print(ent.label_,ent.text)	

# query = "Identify top 4 open source small language model that can run on a personal computer." FOUND ORG
# query = "Identify top 4 open source small language model" FOUND ORG
## comment: they should not be identified as "ORG"

# query = "Identify small language model" # NOT found
## comment: expected behavior 

# query = "Identify top 4 small language model" # FOUND ORG Identify
# query = "identify top 4 small language model" # FOUND ORG Identify
# query = "list top 4 small language models" # FOUND ORG List
# query = "Can you list the four best small language models?" FOUND ORG
## comment: they should not be identified as "ORG"

# query = "identify top 4 software" NOT FOUND
# query = "What is open source?" NOT FOUND
## comment: expected behavior but the English expressions are not idiomatic

# query = "Identify top 4 open source tool" FOUND ORG
## comment: they should not be identified as "ORG"

# query = "What is open source?" NOT FOUND

# query = "Identify top 4 open source" FOUND ORG
## comment: it should not be identified as "ORG"

# query = "identify top 4 software tool" # NOT FOUND
# query = "Identify top 4 software tool" # FOUND ORG
## comment: the second query should not be identified as "ORG"

# query = "Who is Matthew Honnibal?" found PERSON
# query = "Who is Andrew Ng?" found PERSON 
## comment: expected behavior

# query = "Identify top 4 open source small language model that can run on a personal computer." #  ORG Identify
# query = "identify top 4 open source small language model that can run on a personal computer." # NOT FOUND
## comment: they both should not be identified as "ORG"

Your Environment

Operating System:
Windows 10
Python Version Used:
3.10.7
spaCy Version Used:
3.7.2
Environment Information:
3.10.7 (tags/v3.10.7:6cc6b13, Sep 5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]

Thanks.

Answered by danieldk

Apr 15, 2024

The pretrained spaCy pipelines use probabilistic models trained on example sets. As a result, a model will make errors because:

Training sets are not complete enough to know all entities or contexts that would help predicting an entity (or the absence of an entity).
The model capacity may be too limited to capture all entities or contexts that would help predicting the entity (or the absence of an entity).
Training dynamics.

Accuracy will never be 100% on unseen data. That said, there are several ways in which you can improve prediction:

Use a larger model than en_core_web_sm, for instance the lg or trf models. These models are larger and generally have better prediction accuracy.
Trai…

View full answer

danieldk · 2024-04-15T08:34:23Z

danieldk
Apr 15, 2024

The pretrained spaCy pipelines use probabilistic models trained on example sets. As a result, a model will make errors because:

Training sets are not complete enough to know all entities or contexts that would help predicting an entity (or the absence of an entity).
The model capacity may be too limited to capture all entities or contexts that would help predicting the entity (or the absence of an entity).
Training dynamics.

Accuracy will never be 100% on unseen data. That said, there are several ways in which you can improve prediction:

Use a larger model than en_core_web_sm, for instance the lg or trf models. These models are larger and generally have better prediction accuracy.
Train a model your own, larger, or domain-specific data set.
Use EntityRuler to cover cases that are predicted incorrectly.

1 reply

grabastart Apr 15, 2024
Author

@danieldk thanks for the informative response.
I'll try a larger model like "lg" in next few days.
For training a model with domain specific data set, could you point me to its documentation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible ORG misidentification #13438

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Possible ORG misidentification #13438

grabastart Apr 13, 2024

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 1 reply

danieldk Apr 15, 2024

grabastart Apr 15, 2024 Author

grabastart
Apr 13, 2024

Replies: 1 comment 1 reply

danieldk
Apr 15, 2024

grabastart Apr 15, 2024
Author