spacy spancat pipeline performance improvement above texcat #11646

paocarvajal1912 · 2022-10-13T19:25:01Z

paocarvajal1912
Oct 13, 2022

How much has the spanCategorizer improved your models? I am curious. I have been using the textcat for categorizing text with a recall of about 85%. I wonder how much applying a spancategorizer could make a difference. I am trying to predict if a question of a questionnaire will bring confidential (personally identifiable) information (such as name, telephone number, address, social security number, etc.). Some questions may be very long, and then the textcat gets confused. I am expecting that being able to catch key terms should improve the prediction, but I wonder how much improvement others have brought to their models. Many thanks for your answers!

Answered by polm

Oct 18, 2022

Let me link #11663, since it's related.

As mentioned there, using NER/spancat annotations as input to textcat is possible, but not very likely to help. #10470 covers this approach and links to some other Discussions on the issue.

That said, if you do try it we'd love to hear about how it goes!

View full answer

polm · 2022-10-18T06:52:37Z

polm
Oct 18, 2022

Let me link #11663, since it's related.

As mentioned there, using NER/spancat annotations as input to textcat is possible, but not very likely to help. #10470 covers this approach and links to some other Discussions on the issue.

That said, if you do try it we'd love to hear about how it goes!

2 replies

paocarvajal1912 Oct 24, 2022
Author

I made the comparison between [tok2vec, spancat], [tok2vec, textcat], and [tok2vec, spancat, textcat]. I did it as a first toy sample, with a small sample size of around 600. Here a discussion about some of the code, and a description of the problem.

When running by itself, Textcat was superior than Spancat (recall of 61% versus 32%). When added to Textcat, Spancat did not improve the recall (58% versus 61%). Recall is what I am interested in since I want to be able to identify data as confidential. In other words, I prefer, for example, to say that a pet name is confidential, when it is not, than to say that a social security number is not confidential when it actually is.

However, if you care about accuracy, Spancat was better than Textcat (91% versus 86%), and Spancat Textcat got 88%. My data was unbalanced, so even though Spancat generated more false negatives, still the accuracy was better.

I would like to understand these results better, so would love if you could answer some questions for me.

What explains the differences between Spancat and Textcat. Is it related to the technology behind? or are there other reasons? If you can elaborate a little bit about this.
Increasing the sample size would impact results for sure, by my experience with using Textcat. How increasing this size would impact Textcat or Spancat? Would Spancat get more benefit than Textcat? Any criteria/ideas I can use to have a preliminary idea?
What are the criteria for deciding whether to use Spancat or Textcat?

Thank you very much,
Paola

polm Oct 25, 2022

I'm a little confused about your numbers. Usually the accuracy of textcat and spancat wouldn't be directly comparable, because textcat works on whole documents, and spancat works on segments of text within documents, like NER. Have you configured spancat to classify whole documents, or are you doing something else? It might help if you could provide an example of annotations, even for some made up data.

Spancat is usually considered as an alternative to NER, rather than textcat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spacy spancat pipeline performance improvement above texcat #11646

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

spacy spancat pipeline performance improvement above texcat #11646

paocarvajal1912 Oct 13, 2022

Replies: 1 comment · 2 replies

polm Oct 18, 2022

paocarvajal1912 Oct 24, 2022 Author

polm Oct 25, 2022

paocarvajal1912
Oct 13, 2022

Replies: 1 comment 2 replies

polm
Oct 18, 2022

paocarvajal1912 Oct 24, 2022
Author