Skip to content

NLP entity recognition, un-annotated tokens and annotation text length #11839

Discussion options

You must be logged in to vote

Hi @AlexanderBruland ,

  1. Cutting it down into paragraphs should be better. In my experience, I find that there's a tradeoff between handling large Doc objects and memory, so it may be a good idea to start with paragraphs or sentences.
  2. Can you expound a bit on this? Tokens without annotation will be learned as tokens that should not be annotated, so this will definitely affect your model. If you have missed some tokens, they may be treated as false negatives later on. If you are unsure of a particular paragraph (maybe because after processing it looks corrupted and weird, then it might be wise to not include them in your corpora).
  3. Tesseract OCR should be a good baseline. I haven't personall…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@AlexanderBruland
Comment options

Answer selected by AlexanderBruland
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage
2 participants