Question on feature assignment strategies for optimizing morphologizer #11682

megamattc · 2022-10-20T19:56:47Z

megamattc
Oct 20, 2022

Hello,

I was wondering if the developers or an expert spacy user knew if it made a difference in the training data for the parser or morphologizer if tokens were maximally specified for the presence or absence of features? That is on the one-hand, if optional morphological features on tokens were explicitly listed as present/non-present, implying that (say because of broken context or corrupt form) saying nothing about the feature means it may still be present, while on the other, if we only allow a positive/existential option for the feature, and saying nothing means either it isn't present or we don't know. As an example of the latter case is if verbs can all take a subordinative suffix -ni, and I say SubSuff=Yes if it is clearly present on a verb, but say nothing if it is either clearly absent from a complete form, or if I don't want to commit either way to saying it is there. Using the former strategy, I would always include SubSuff=No on all verbs where I commit to saying there is no suffix there, and saying nothing means I don't commit.

Does this make a difference, particularly if I have a small training set?

Answered by polm

Oct 21, 2022

Honestly it's hard to say - when confronted with alternatives like this I would try both and compare the results. For annotations what you can do is create the more detailed annotations, and then convert them to less detailed ones, and try the same pipeline on both.

You do need to be careful about annotating "unclear" and "definitely no" (or some other combination) the same way, as it results in the model not being able to differentiate those cases. If you can separate definite and ambiguous cases, it's OK to use an intermediate label with the indefinite state that can be cleaned up by a later component.

View full answer

polm · 2022-10-21T06:29:22Z

polm
Oct 21, 2022

Honestly it's hard to say - when confronted with alternatives like this I would try both and compare the results. For annotations what you can do is create the more detailed annotations, and then convert them to less detailed ones, and try the same pipeline on both.

You do need to be careful about annotating "unclear" and "definitely no" (or some other combination) the same way, as it results in the model not being able to differentiate those cases. If you can separate definite and ambiguous cases, it's OK to use an intermediate label with the indefinite state that can be cleaned up by a later component.

1 reply

megamattc Oct 21, 2022
Author

Ok. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on feature assignment strategies for optimizing morphologizer #11682

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Question on feature assignment strategies for optimizing morphologizer #11682

megamattc Oct 20, 2022

Replies: 1 comment · 1 reply

polm Oct 21, 2022

megamattc Oct 21, 2022 Author

megamattc
Oct 20, 2022

Replies: 1 comment 1 reply

polm
Oct 21, 2022

megamattc Oct 21, 2022
Author