Overfitted model #9236

farrandi · 2021-09-17T17:41:20Z

farrandi
Sep 17, 2021

Hello, I have tried training a new model and it does relatively well with previous data but not well with new data; meaning it has a problem generalizing. I suspect this is because of overfitting.

Here are my data debug and training results:

============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: en
Training pipeline: sentencizer, tok2vec, ner
Frozen components: sentencizer
11046 training docs
4735 evaluation docs
⚠ 877 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 351669 total word(s) in the data (13483 unique)
⚠ 886 misaligned tokens in the training data
⚠ 367 misaligned tokens in the dev data
ℹ No word vectors present in the package

========================== Named Entity Recognition ==========================
ℹ 6 label(s)
0 missing value(s) (tokens with '-' label)
⚠ 356 entity span(s) with punctuation
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
Entity spans consisting of or starting/ending with punctuation can not be
trained with a noise level > 0.

================================== Summary ==================================
✔ 5 checks passed
⚠ 4 warnings

✔ Created output directory: models/second_iter_2
ℹ Saving to output directory: models/second_iter_2
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2021-09-16 17:11:19,957] [INFO] Set up nlp object from config
[2021-09-16 17:11:19,965] [INFO] Pipeline: ['sentencizer', 'tok2vec', 'ner']
[2021-09-16 17:11:20,257] [INFO] Created vocabulary
[2021-09-16 17:11:20,258] [INFO] Finished initializing nlp object
[2021-09-16 17:11:28,636] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['sentencizer', 'tok2vec', 'ner']
ℹ Frozen components: ['sentencizer']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  SENTS_F  SENTS_P  SENTS_R  ENTS_F  ENTS_P  ENTS_R  SCORE
---  ------  ------------  --------  -------  -------  -------  ------  ------  ------  ------
  0       0          0.00     61.27     0.00     0.00     0.00    0.59    2.69    0.33    0.00
  0     200        384.67   7407.70     0.00     0.00     0.00   92.01   91.78   92.23    0.46
  0     400        324.05   1453.70     0.00     0.00     0.00   97.30   97.14   97.47    0.49
  0     600        293.77    870.04     0.00     0.00     0.00   98.10   97.74   98.47    0.49
  0     800        281.56    642.34     0.00     0.00     0.00   98.46   98.38   98.54    0.49
  0    1000        327.97    549.47     0.00     0.00     0.00   98.31   97.82   98.80    0.49
  0    1200        314.19    543.25     0.00     0.00     0.00   99.21   99.08   99.34    0.50
  0    1400        242.58    410.51     0.00     0.00     0.00   98.88   98.81   98.95    0.49
  1    1600        345.26    457.07     0.00     0.00     0.00   99.31   99.24   99.39    0.50
  1    1800        440.99    482.07     0.00     0.00     0.00   99.37   99.28   99.46    0.50
  1    2000        335.41    321.95     0.00     0.00     0.00   99.42   99.43   99.41    0.50
  1    2200        478.47    441.86     0.00     0.00     0.00   99.44   99.40   99.48    0.50
  1    2400        472.65    439.53     0.00     0.00     0.00   99.47   99.52   99.41    0.50
  2    2600        583.42    475.27     0.00     0.00     0.00   99.58   99.46   99.70    0.50
  2    2800        641.44    534.79     0.00     0.00     0.00   99.62   99.51   99.73    0.50
  3    3000        487.92    471.47     0.00     0.00     0.00   99.72   99.66   99.78    0.50
  3    3200        642.01    468.65     0.00     0.00     0.00   99.68   99.63   99.73    0.50
  4    3400        591.73    388.93     0.00     0.00     0.00   99.74   99.68   99.80    0.50
  4    3600        602.04    334.65     0.00     0.00     0.00   99.74   99.69   99.80    0.50
  4    3800       1067.99    424.30     0.00     0.00     0.00   99.73   99.69   99.76    0.50
  5    4000        463.26    254.14     0.00     0.00     0.00   99.71   99.68   99.75    0.50
  5    4200        777.19    354.29     0.00     0.00     0.00   99.71   99.62   99.80    0.50
  6    4400        718.54    293.97     0.00     0.00     0.00   99.72   99.61   99.82    0.50
  6    4600        604.01    202.31     0.00     0.00     0.00   99.73   99.67   99.80    0.50
  7    4800        837.69    241.53     0.00     0.00     0.00   99.71   99.64   99.79    0.50
  7    5000        789.24    223.25     0.00     0.00     0.00   99.75   99.68   99.81    0.50
  8    5200        967.52    262.19     0.00     0.00     0.00   99.74   99.69   99.80    0.50
  8    5400        846.13    175.65     0.00     0.00     0.00   99.76   99.73   99.79    0.50
  9    5600        932.26    195.28     0.00     0.00     0.00   99.72   99.67   99.76    0.50
  9    5800        760.36    169.80     0.00     0.00     0.00   99.74   99.67   99.80    0.50
 10    6000       1085.28    241.37     0.00     0.00     0.00   99.77   99.74   99.79    0.50
 10    6200        888.86    174.76     0.00     0.00     0.00   99.76   99.69   99.83    0.50
 11    6400       1019.41    235.78     0.00     0.00     0.00   99.71   99.63   99.80    0.50
 11    6600        952.28    204.88     0.00     0.00     0.00   99.69   99.64   99.75    0.50
 12    6800        921.77    166.79     0.00     0.00     0.00   99.73   99.68   99.78    0.50
 12    7000       1098.47    175.60     0.00     0.00     0.00   99.76   99.69   99.83    0.50
 13    7200       1154.65    186.99     0.00     0.00     0.00   99.74   99.69   99.79    0.50
 13    7400        784.15    123.85     0.00     0.00     0.00   99.73   99.71   99.74    0.50
 13    7600        955.72    161.21     0.00     0.00     0.00   99.76   99.69   99.83    0.50
✔ Saved pipeline to output directory
models/second_iter_2/model-last

The accuracy of 99% suggests overfitting. What are the common methods in spacy to be better to generalize more or to reduce overfitting?

Answered by polm

Sep 18, 2021

The most common cause of overfitting is a tiny dataset, but that doesn't appear to affect you so there are some other things you'll have to investigate.

One thing is you have a lot of misaligned tokens, you should look into why that's happening. Another is that you have significant overlap between your training and dev set - that won't cause overfitting directly, but you'll want to fix it.

Putting aside those problems, what kind of generalization issues is your model having specifically? Is it sensitive to case changes or something? In that case you'll want to look at data augmentation, which is an easy and important way to build robustness.

Another thing you can do is check what your lab…

View full answer

polm · 2021-09-18T06:59:59Z

polm
Sep 18, 2021

The most common cause of overfitting is a tiny dataset, but that doesn't appear to affect you so there are some other things you'll have to investigate.

One thing is you have a lot of misaligned tokens, you should look into why that's happening. Another is that you have significant overlap between your training and dev set - that won't cause overfitting directly, but you'll want to fix it.

Putting aside those problems, what kind of generalization issues is your model having specifically? Is it sensitive to case changes or something? In that case you'll want to look at data augmentation, which is an easy and important way to build robustness.

Another thing you can do is check what your labelled examples look like. You have a lot of training data, but how many different words are actually under a given label? Maybe you don't have enough variation in your training data for some reason. Without seeing it it's hard to say more about this. I think this is especially likely given that you're getting 99f1 even on your dev set.

2 replies

farrandi Sep 20, 2021
Author

Thank you for the reply! I will try and clean up the training dataset.
Other than those mentioned are there any methods to stop training earlier? so I can have the model before the loss NER starts increasing.

svlandeg Sep 21, 2021
Maintainer

Typically you'd have overfitting when the loss (on your training dataset) still improves (i.e. goes DOWN), while the performance on your dev set (p/r/F/acc) deteriorates (decreases). This doesn't actually seem to be the case here. From the table you've shared, it looks like the loss is still going down and the performance on the dev set is just really good.

If you've found that the model does not generalize well to new data, your dev set probably isn't representative enough of the whole challenge, and the best course of action would be to improve upon that. And as Paul said - make sure the training set itself is representative (varied) as well.

The "best" model that spaCy saves, is the one where the performance on the dev set is the best (regardless of loss) - so this approach already takes into account the issue of overfitting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overfitted model #9236

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Overfitted model #9236

farrandi Sep 17, 2021

Replies: 1 comment · 2 replies

polm Sep 18, 2021

farrandi Sep 20, 2021 Author

svlandeg Sep 21, 2021 Maintainer

farrandi
Sep 17, 2021

Replies: 1 comment 2 replies

polm
Sep 18, 2021

farrandi Sep 20, 2021
Author

svlandeg Sep 21, 2021
Maintainer