SpanCategorizer: SPAN context vs content #8947

mbrunecky · 2021-08-12T21:50:38Z

mbrunecky
Aug 12, 2021

When training SpanCategorizer, is there any way of controlling training to emphasize span context over the span content?

A disclaimer first: I admit a very limited knowledge of how Spacy neural networks work under the hood...
I believe that SpanCategorizer is a successor to earlier Spacy NER. Many NER solutions seem to favor RNN such as LSTM, but Spacy used CNN with some engineous usage of hidden layer(s) - which I still do not understand.

In my observation of Spacy NER (and now SpanCat) behavior, it seems to take into account (while training) BOTH the surrounding 'window' (context) AND the labelled NER/span 'value' (content). For illustration, when 'training' on the following 3 sentences, where trained PLACE entities are London and Paris:

I like visiting London in spring.
I like visiting Paris in the fall.
Visiting London or Paris is always a joy.

Then the PLACE prediction for sentence 'It is not too cold in London in winter' may return 'London' because training has 'taught' the model that London (the content) is the PLACE - regardless of the context.

In my case, I need to de-emphasize the entity/span content. Depending upon the context, 'London' above may be a PLACE or a totally different entity/span label. The knowledge that London frequently happened to be the PLACE is counter-productive.

Answered by polm

Aug 13, 2021

There is not a flag or value you can modify to change this behavior. Basically the model will decide on how to represent context and token info based on the training data.

What you can do is augment your training data by replacing your labeled entities with other words. If you use common nouns then the model may be able to learn to rely on context more, though I think the effect might not be very significant.

If you're feeling very adventurous you could try detecting entities, masking them by replacing them with a filler token like XXX, and seeing how that's labelled, to avoid any influence from token identity. But I'm not sure I'd recommend that.

Are you doing something where you label r…

View full answer

polm · 2021-08-13T03:45:24Z

polm
Aug 13, 2021

There is not a flag or value you can modify to change this behavior. Basically the model will decide on how to represent context and token info based on the training data.

What you can do is augment your training data by replacing your labeled entities with other words. If you use common nouns then the model may be able to learn to rely on context more, though I think the effect might not be very significant.

If you're feeling very adventurous you could try detecting entities, masking them by replacing them with a filler token like XXX, and seeing how that's labelled, to avoid any influence from token identity. But I'm not sure I'd recommend that.

Are you doing something where you label roles and not just entities? Like if you have "Alice sold Bob a car" and "Bob bought a car from Alice" and you're labelling not PERSON but BUYER and SELLER? That's called Semantic Role Labelling, and it's very similar to NER but harder. You might want to look at the research for that.

1 reply

mbrunecky Aug 13, 2021
Author

Thank you.
And yes, I am labelling 'roles' (my existing rules-based system actually emphasizes a notion of 'role'). I am just trying to replace the rule currently using pre/post NGRAM sequence statistical frequencies with a trained neural network.

I even made an attempt at replacing labelled entities with 'randomized' words, but the result was not good. I am getting ready to test replacing the 'high frequency' labelled ones (i.e. Wells Fargo Bank) with randomly selected 'low frequency' ones.

I had very promising results using Spacy 2 NER, but Spacy 3 NER killed me. I still do not have any idea why the Spacy 3 NER scores are 5-10% worse (training on identical datasets). There must have been some change that is working against me. Obviously, the way the tok2vec is used has changed considerably, but what do I know...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpanCategorizer: SPAN context vs content #8947

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

SpanCategorizer: SPAN context vs content #8947

mbrunecky Aug 12, 2021

Replies: 1 comment · 1 reply

polm Aug 13, 2021

mbrunecky Aug 13, 2021 Author

mbrunecky
Aug 12, 2021

Replies: 1 comment 1 reply

polm
Aug 13, 2021

mbrunecky Aug 13, 2021
Author