NLP entity recognition, un-annotated tokens and annotation text length #11839

AlexanderBruland · 2022-11-21T14:52:34Z

AlexanderBruland
Nov 21, 2022

Hello! I've recently been involved in a NLP project using spacy, where we try to extract data from contract documents. In relation to that I have a few questions:

The documents we are extracting information from are typically between 10 to 20 pages. Right now we are doing the annotation on full documents, but could we see better performance if we cut down the paragraphs with information in them and annotate them later? Does the document length matter?
Lets say I annotate a document and I miss some of the tokens (The documents have some instances where I believe marking some entities could distract the model due to both structure from OCR and bad input). Does the spacy model look at places where I dont annotate and mark that as a place where no entity should be recognized?

We are having quite good results on standarized documents and more mixed results on documents where the authors have taken more "freedom" which is expected. But as a first entry to machine learning, this is a really fun project and NLP has become much more powerful than I expected! :)

question: We are using pytesseract OCR right now, as it's a free option and can be hosted without voilating GDPR rules. What are your thoughts on other OCR software? I kind of wish the OCR was more stable. Do you think dropping 500 usd on Omniscan is worth it?

Thank you! Best regards Alexander

Answered by ljvmiranda921

Nov 22, 2022

Hi @AlexanderBruland ,

Cutting it down into paragraphs should be better. In my experience, I find that there's a tradeoff between handling large Doc objects and memory, so it may be a good idea to start with paragraphs or sentences.
Can you expound a bit on this? Tokens without annotation will be learned as tokens that should not be annotated, so this will definitely affect your model. If you have missed some tokens, they may be treated as false negatives later on. If you are unsure of a particular paragraph (maybe because after processing it looks corrupted and weird, then it might be wise to not include them in your corpora).
Tesseract OCR should be a good baseline. I haven't personall…

View full answer

ljvmiranda921 · 2022-11-22T06:29:43Z

ljvmiranda921
Nov 22, 2022

Hi @AlexanderBruland ,

Cutting it down into paragraphs should be better. In my experience, I find that there's a tradeoff between handling large Doc objects and memory, so it may be a good idea to start with paragraphs or sentences.
Can you expound a bit on this? Tokens without annotation will be learned as tokens that should not be annotated, so this will definitely affect your model. If you have missed some tokens, they may be treated as false negatives later on. If you are unsure of a particular paragraph (maybe because after processing it looks corrupted and weird, then it might be wise to not include them in your corpora).
Tesseract OCR should be a good baseline. I haven't personally tested Omniscan so I cannot comment. You can also check out other open-source tools like LayoutParser or pdfplumber.

1 reply

AlexanderBruland Nov 22, 2022
Author

Thank you very much! This was exactly the answers I were looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP entity recognition, un-annotated tokens and annotation text length #11839

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

NLP entity recognition, un-annotated tokens and annotation text length #11839

AlexanderBruland Nov 21, 2022

Replies: 1 comment · 1 reply

ljvmiranda921 Nov 22, 2022

AlexanderBruland Nov 22, 2022 Author

AlexanderBruland
Nov 21, 2022

Replies: 1 comment 1 reply

ljvmiranda921
Nov 22, 2022

AlexanderBruland Nov 22, 2022
Author