Question on feature assignment strategies for optimizing morphologizer #11682
-
Hello, I was wondering if the developers or an expert spacy user knew if it made a difference in the training data for the parser or morphologizer if tokens were maximally specified for the presence or absence of features? That is on the one-hand, if optional morphological features on tokens were explicitly listed as present/non-present, implying that (say because of broken context or corrupt form) saying nothing about the feature means it may still be present, while on the other, if we only allow a positive/existential option for the feature, and saying nothing means either it isn't present or we don't know. As an example of the latter case is if verbs can all take a subordinative suffix -ni, and I say SubSuff=Yes if it is clearly present on a verb, but say nothing if it is either clearly absent from a complete form, or if I don't want to commit either way to saying it is there. Using the former strategy, I would always include SubSuff=No on all verbs where I commit to saying there is no suffix there, and saying nothing means I don't commit. Does this make a difference, particularly if I have a small training set? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Honestly it's hard to say - when confronted with alternatives like this I would try both and compare the results. For annotations what you can do is create the more detailed annotations, and then convert them to less detailed ones, and try the same pipeline on both. You do need to be careful about annotating "unclear" and "definitely no" (or some other combination) the same way, as it results in the model not being able to differentiate those cases. If you can separate definite and ambiguous cases, it's OK to use an intermediate label with the indefinite state that can be cleaned up by a later component. |
Beta Was this translation helpful? Give feedback.
Honestly it's hard to say - when confronted with alternatives like this I would try both and compare the results. For annotations what you can do is create the more detailed annotations, and then convert them to less detailed ones, and try the same pipeline on both.
You do need to be careful about annotating "unclear" and "definitely no" (or some other combination) the same way, as it results in the model not being able to differentiate those cases. If you can separate definite and ambiguous cases, it's OK to use an intermediate label with the indefinite state that can be cleaned up by a later component.