spancat - only one label being learned #8967
-
Hi, I'm wanting to use spancat to detect certain references in text and then also detect the parts of the references i.e. pagenumbers, names of authors, editions etc. I know I have very few samples but I think it should still at least learn something about the other ones, right? Does anyone have any idea what might be happening? Here's my config:
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
To be clear your data is something like:
And If you're getting flat zero scores then that suggests some kind of configuration issue rather than just not having much data. Can you share the output of Also, can you show how you're building your training data? You might want to print out some of your training docs (with their spans) to show they're being annotated like you expect - it sounds like maybe the inner labels are getting dropped somehow. |
Beta Was this translation helpful? Give feedback.
-
Ah, I think this may just be a known bug in the scorer. If you run the model and inspect docs, do you see that some of the nested spans are labeled as expected? The scorer has the value for - kwargs.setdefault("multi_label", True)
+ kwargs.setdefault("allow_overlap", True) This is fixed in |
Beta Was this translation helpful? Give feedback.
-
In case someone comes across this thread in the future, there was a bug in spancat related to training on overlapping spans which should be fixed in spacy v3.1.2+ (see #9007). |
Beta Was this translation helpful? Give feedback.
Ah, I think this may just be a known bug in the scorer. If you run the model and inspect docs, do you see that some of the nested spans are labeled as expected?
The scorer has the value for
allow_overlap
set incorrectly. You can edit this line inspacy/pipeline/spancat.py
to fix it:This is fixed in
develop
for the upcoming v3.2, but we need to backport it to the next v3.1.x release, too.