spancat - only one label being learned #8967

lnatspacy · 2021-08-15T08:30:02Z

lnatspacy
Aug 15, 2021

Hi,

I'm wanting to use spancat to detect certain references in text and then also detect the parts of the references i.e. pagenumbers, names of authors, editions etc.
I've annotated only 130 or so paragraphs each containing 1-5 examples of such references.
Just to see how far this would get me I trained a model and what I noticed was that only 1 of the labels has an F-score above 0.00 (that label as a decent one around 60 though).
Only the labels that covers the entire reference is being recognized it seems, none of the labels that span only certain tokens/spans within the reference.

I know I have very few samples but I think it should still at least learn something about the other ones, right?

Does anyone have any idea what might be happening?

Here's my config:

[paths]
train = data/train.spacy
dev = data/dev.spacy
vectors = "de_core_news_lg"
init_tok2vec = null
raw_text = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "de"
pipeline = ["spancat"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.spancat]
factory = "spancat"
max_positive = null
spans_key = "sc"
threshold = 0.5

[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"

[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128

[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null

[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2Vec.v1"

[components.spancat.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v1"
width = 96
rows = [5000,2000,1000,1000]
attrs = ["ORTH","PREFIX","SUFFIX","SHAPE"]
include_static_vectors = false

[components.spancat.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v1"
width = 96
window_size = 1
maxout_pieces = 3
depth = 4

[components.spancat.suggester]
@misc = "spacy.ngram_range_suggester.v1"
min_size = 1
max_size = 35

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.pretrain]
@readers = "spacy.JsonlCorpus.v1"
path = ${paths.raw_text}
min_length = 5
max_length = 500
limit = 0

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
spans_sc_f = 1.0
spans_sc_p = 0.0
spans_sc_r = 0.0

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = {"@callbacks":"customize_nlp_tokenizer_spancat"}
after_init = null

[initialize.components]

[initialize.components.spancat]

[initialize.components.spancat.labels]
@readers = "spacy.read_labels.v1"
path = "data\\labels\\spancat.json"

[initialize.tokenizer]

Answered by adrianeboyd

Aug 16, 2021

Ah, I think this may just be a known bug in the scorer. If you run the model and inspect docs, do you see that some of the nested spans are labeled as expected?

The scorer has the value for allow_overlap set incorrectly. You can edit this line in spacy/pipeline/spancat.py to fix it:

-        kwargs.setdefault("multi_label", True)
+        kwargs.setdefault("allow_overlap", True)

This is fixed in develop for the upcoming v3.2, but we need to backport it to the next v3.1.x release, too.

View full answer

polm · 2021-08-15T12:29:06Z

polm
Aug 15, 2021

To be clear your data is something like:

This is based on Smith 1997 (p93), who says that something something

And Smith 1997 (p93) would be the whole reference, and you might have Author, Year, Page inside that, or something similar?

If you're getting flat zero scores then that suggests some kind of configuration issue rather than just not having much data. Can you share the output of spacy debug data? It doesn't provide a lot of details for spancat but it's a start.

Also, can you show how you're building your training data? You might want to print out some of your training docs (with their spans) to show they're being annotated like you expect - it sounds like maybe the inner labels are getting dropped somehow.

1 reply

lnatspacy Aug 15, 2021
Author

That's a fair example, allthough we're talking about longer references from an academic space.
Here's the output, as you say it's not much:

============================ Data file validation ===========================
✔ Corpus is loadable
✔ Pipeline can be initialized with data

=============================== Training stats ==============================
Language: de
Training pipeline: spancat
128 training docs
31 evaluation docs
✔ No overlap between training and evaluation data
⚠ Low number of examples to train a new pipeline (128)

============================== Vocab & Vectors ==============================
ℹ 7296 total word(s) in the data (2461 unique)
ℹ 500000 vectors (500000 unique keys, 300 dimensions)
⚠ 648 words in training data without vectors (9%)

================================== Summary ==================================
✔ 3 checks passed
⚠ 2 warnings

I'm using Prodigy to create the training data so that's probably not it. I looked into the jsonl after exporting the data with db-out and it looks good. I'm not comfortable sharing samples since I'm working with copyrighted text, but the inner labels are definitely present.

/edit: I tried to use Prodigy for training and I get the exact same result (kinda expeted since it uses spacy train internally), so it's probably not a config thing either?

adrianeboyd · 2021-08-16T06:13:55Z

adrianeboyd
Aug 16, 2021

Ah, I think this may just be a known bug in the scorer. If you run the model and inspect docs, do you see that some of the nested spans are labeled as expected?

The scorer has the value for allow_overlap set incorrectly. You can edit this line in spacy/pipeline/spancat.py to fix it:

-        kwargs.setdefault("multi_label", True)
+        kwargs.setdefault("allow_overlap", True)

This is fixed in develop for the upcoming v3.2, but we need to backport it to the next v3.1.x release, too.

4 replies

lnatspacy Aug 16, 2021
Author

do you see that some of the nested spans are labeled as expected?

Yes, incredibly rarely but I did spot them in 1 of thousands of paragraphs.

Unfortunately the fix didn't help . I'm still getting all zeros for all but one label. I'm positive I applied it to the correct module.

adrianeboyd Aug 16, 2021

If you try a small overfitting test case with just a few examples where you train and evaluate on the exact same data?

lnatspacy Aug 16, 2021
Author

If you try a small overfitting test case with just a few examples where you train and evaluate on the exact same data?

Yes, if I just copy my train.spacy to dev.spacy, some of the other labels are moving past 0.
So I guess the conclusion is that I'll need an absolute truckload of samples(?).

Thank you for bearing with me :)

svlandeg Aug 17, 2021
Maintainer

Yea, if this behaviour doesn't show up when training & testing on the exact same data, that means that there's nothing technically wrong, there's probably just too little data, or the annotations are pretty varied, making it difficult for the algorithms to generalize. I think this might happen more often with the spancat than with the NER, as named entities often have a more "crisp" definition, and could be easier to learn in general.

adrianeboyd · 2021-08-20T14:34:02Z

adrianeboyd
Aug 20, 2021

In case someone comes across this thread in the future, there was a bug in spancat related to training on overlapping spans which should be fixed in spacy v3.1.2+ (see #9007).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spancat - only one label being learned #8967

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

spancat - only one label being learned #8967

lnatspacy Aug 15, 2021

Replies: 3 comments · 5 replies

polm Aug 15, 2021

lnatspacy Aug 15, 2021 Author

adrianeboyd Aug 16, 2021

lnatspacy Aug 16, 2021 Author

adrianeboyd Aug 16, 2021

lnatspacy Aug 16, 2021 Author

svlandeg Aug 17, 2021 Maintainer

adrianeboyd Aug 20, 2021

lnatspacy
Aug 15, 2021

Replies: 3 comments 5 replies

polm
Aug 15, 2021

lnatspacy Aug 15, 2021
Author

adrianeboyd
Aug 16, 2021

lnatspacy Aug 16, 2021
Author

lnatspacy Aug 16, 2021
Author

svlandeg Aug 17, 2021
Maintainer

adrianeboyd
Aug 20, 2021