Poorer performance for NER trained with generic language model, compared to larger domain-specific model #11725

Jason-B-Jiang · 2022-10-31T18:44:47Z

Jason-B-Jiang
Oct 31, 2022

Hi all,

I've been training new NER models for existing language models to detect parasite species + their hosts in scientific texts. I adapted the script from https://github.com/explosion/projects/blob/v3/pipelines/ner_demo_replace/scripts/create_config.py to replace and train new NER models for the language models I've chosen.

Counterintuitively, I've noticed poorer performance from language models catered for scientific texts, compared to a more generic model like en_core_web_lg. For instance, F1 for parasite species is 84.96% for en_core_web_lg, but 77.29% and 77.36% respectively for en_core_sci_lg and en_core_sci_scibert from scispaCy.

I was wondering if anyone has insight into why this is happening? For reference, here is a link to download the config files + binary files used for training: https://minhaskamal.github.io/DownGit/#/home?url=https://github.com/Jason-B-Jiang/microsporidia_text_mining/tree/main/src/3_train_pipelines/microsp_host_relation_extraction

I've been running "python -m spacy train ./{name of model}.cfg --output {name of model}_model --paths.train ./train.spacy --paths.dev ./valid.spacy" to train models, and "python -m spacy evaluate {model name}_model ./test.spacy" to evaluate models on a holdout set.

Thank you in advance! Every bit of advice helps :)

Answered by polm

Nov 1, 2022

Hey, thanks for linking to your complete project, but a link to just the config files so we can view them without downloading is more helpful.

https://github.com/Jason-B-Jiang/microsporidia_text_mining/tree/main/src/3_train_pipelines/microsp_host_relation_extraction

I haven't looked over things in detail yet, but it's possible that your specific docs just differ from the training data used for SciSpacy, even if you're using "scientific texts".

View full answer

polm · 2022-11-01T06:00:47Z

polm
Nov 1, 2022

Hey, thanks for linking to your complete project, but a link to just the config files so we can view them without downloading is more helpful.

https://github.com/Jason-B-Jiang/microsporidia_text_mining/tree/main/src/3_train_pipelines/microsp_host_relation_extraction

I haven't looked over things in detail yet, but it's possible that your specific docs just differ from the training data used for SciSpacy, even if you're using "scientific texts".

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poorer performance for NER trained with generic language model, compared to larger domain-specific model #11725

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Poorer performance for NER trained with generic language model, compared to larger domain-specific model #11725

Jason-B-Jiang Oct 31, 2022

Replies: 1 comment

polm Nov 1, 2022

Jason-B-Jiang
Oct 31, 2022

polm
Nov 1, 2022