Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix use_gold_ents behaviour for EntityLinker #13400

Merged
merged 11 commits into from
Apr 16, 2024
Merged

Conversation

svlandeg
Copy link
Member

Description

The use_gold_ents flag was introduced to allow the entity_linker to train on gold entities, even if there's no (annotating) NER component in the pipeline.

I think this behaviour was buggy because of a few reasons:

  1. In initialize(), NER "predictions" from eg.reference were added to eg.predicted for the first 10 examples, and never cleaned up/restored afterwards.
  2. In update(), this transfer happened on all examples, but here the ents were "restored" before calling the loss function. In theory, this should have prevented the EL to learn anything at all, except that in the corresponding unit test, this bug got masked by bug 1, which resulted in a few spurious annotations on the first 10 documents
  3. Because of how a spaCy pipeline works internally, the scoring could never work out of the box, because Language.evaluate() calls pipe() on the predicted docs, which won't have entities if there is no (annotating) NER in the pipeline.

To test some of this behaviour, I used different configs with the EL Emerson example, cf explosion/projects#207. The "EL only" config would produce all-zero lines with master:

E    #       LOSS ENTIT...  NEL_MICRO_F  NEL_MICRO_R  NEL_MICRO_P  SCORE
---  ------  -------------  -----------  -----------  -----------  ------
  0       0           0.00         0.00         0.00         0.00    0.00
 33     200           0.00         0.00         0.00         0.00    0.00
 73     400           0.00         0.00         0.00         0.00    0.00
123     600           0.00         0.00         0.00         0.00    0.00

Then it would produce actual loss scores after fixing 1 and 2:

E    #       LOSS ENTIT...  NEL_MICRO_F  NEL_MICRO_R  NEL_MICRO_P  SCORE
---  ------  -------------  -----------  -----------  -----------  ------
  0       0           3.30         0.00         0.00         0.00    0.00
 33     200          55.77         0.00         0.00         0.00    0.00
 73     400           3.93         0.00         0.00         0.00    0.00
123     600           1.99         0.00         0.00         0.00    0.00

And finally, after fixing 3, it would give actual scores:

E    #       LOSS ENTIT...  NEL_MICRO_F  NEL_MICRO_R  NEL_MICRO_P  SCORE
---  ------  -------------  -----------  -----------  -----------  ------
  0       0           3.30        33.33        33.33        33.33    0.33
 33     200          57.55        83.33        83.33        83.33    0.83
 74     400           4.25        83.33        83.33        83.33    0.83
124     600           1.95        83.33        83.33        83.33    0.83

Types of change

bug fixes & enhancement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@svlandeg svlandeg added bug Bugs and behaviour differing from documentation feat / nel Feature: Named Entity linking labels Mar 27, 2024
Comment on lines 244 to 258
def _score_augmented(examples, **kwargs):
# Because of how spaCy works, we can't just score immediately, because Language.evaluate
# calls pipe() on the predicted docs, which won't have entities if there is no NER in the pipeline.
if not self.use_gold_ents:
return scorer(examples, **kwargs)
else:
examples = self._augment_examples(examples)
docs = self.pipe(
(eg.predicted for eg in examples),
)
for eg, doc in zip(examples, docs):
eg.predicted = doc
return scorer(examples, **kwargs)

self.scorer = _score_augmented
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole bit is surely pretty hacky, but considering bug 3 as explained in the PR, I don't see a better option other than changing the entire mechanism how evaluation/scoring of a pipeline works...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is not really satisfying. The workaround makes sense in this context though.

new_examples = []
for eg in examples:
ents, _ = eg.get_aligned_ents_and_ner()
new_eg = eg.copy()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a copy here feels safest? Not 100% about all the possible interactions with all other components in the pipeline, before or after, annotated or not, and frozen or not...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do we manipulate examples in other components? I'm also unsure about this. Either way 👍 for copying it.

Copy link
Contributor

@rmitsch rmitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great spot 👀 Can you elaborate on the cleaning up/restoration from reason 1.? Not sure what you mean by that.

spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved
Comment on lines 244 to 258
def _score_augmented(examples, **kwargs):
# Because of how spaCy works, we can't just score immediately, because Language.evaluate
# calls pipe() on the predicted docs, which won't have entities if there is no NER in the pipeline.
if not self.use_gold_ents:
return scorer(examples, **kwargs)
else:
examples = self._augment_examples(examples)
docs = self.pipe(
(eg.predicted for eg in examples),
)
for eg, doc in zip(examples, docs):
eg.predicted = doc
return scorer(examples, **kwargs)

self.scorer = _score_augmented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is not really satisfying. The workaround makes sense in this context though.

new_examples = []
for eg in examples:
ents, _ = eg.get_aligned_ents_and_ner()
new_eg = eg.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, do we manipulate examples in other components? I'm also unsure about this. Either way 👍 for copying it.

spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved
spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved
spacy/pipeline/entity_linker.py Outdated Show resolved Hide resolved
spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved
spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved
spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved
spacy/tests/pipeline/test_entity_linker.py Show resolved Hide resolved
svlandeg and others added 3 commits April 2, 2024 09:29
@svlandeg
Copy link
Member Author

svlandeg commented Apr 2, 2024

Can you elaborate on the cleaning up/restoration from reason 1.?

When the pipeline gets initialized, all the individual components their initialize() method is being called on a set of Example objects. The entity linker was changing these examples by setting gold entities on the first 10 examples (to allow dimension inference with Thinc), and not cleaning up afterwards, leaving the examples in an inconsistent/wrong state for the next component/other processing.

@svlandeg svlandeg merged commit 2e23346 into explosion:master Apr 16, 2024
9 checks passed
@svlandeg svlandeg deleted the fix/el branch April 16, 2024 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs and behaviour differing from documentation feat / nel Feature: Named Entity linking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants