Poorer performance for NER trained with generic language model, compared to larger domain-specific model #11725
-
Hi all, I've been training new NER models for existing language models to detect parasite species + their hosts in scientific texts. I adapted the script from https://github.com/explosion/projects/blob/v3/pipelines/ner_demo_replace/scripts/create_config.py to replace and train new NER models for the language models I've chosen. Counterintuitively, I've noticed poorer performance from language models catered for scientific texts, compared to a more generic model like en_core_web_lg. For instance, F1 for parasite species is 84.96% for en_core_web_lg, but 77.29% and 77.36% respectively for en_core_sci_lg and en_core_sci_scibert from scispaCy. I was wondering if anyone has insight into why this is happening? For reference, here is a link to download the config files + binary files used for training: https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/Jason-B-Jiang/microsporidia_text_mining/tree/main/src/3_train_pipelines/microsp_host_relation_extraction I've been running "python -m spacy train ./{name of model}.cfg --output {name of model}_model --paths.train ./train.spacy --paths.dev ./valid.spacy" to train models, and "python -m spacy evaluate {model name}_model ./test.spacy" to evaluate models on a holdout set. Thank you in advance! Every bit of advice helps :) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hey, thanks for linking to your complete project, but a link to just the config files so we can view them without downloading is more helpful. I haven't looked over things in detail yet, but it's possible that your specific docs just differ from the training data used for SciSpacy, even if you're using "scientific texts". |
Beta Was this translation helpful? Give feedback.
Hey, thanks for linking to your complete project, but a link to just the config files so we can view them without downloading is more helpful.
https://github.com/Jason-B-Jiang/microsporidia_text_mining/tree/main/src/3_train_pipelines/microsp_host_relation_extraction
I haven't looked over things in detail yet, but it's possible that your specific docs just differ from the training data used for SciSpacy, even if you're using "scientific texts".