Miscellaneous fixes to the x-transformers implementation #79

Skip backward if loss is NaN. Stop training if enough batches are skipped.

The default value must be either zero or None, depending on whether accuracy is reported or not.

Parameters in the TransformerWrapper, e.g. to_logits, need their own distributed component and optimizer.

The adapter injection code was causing parameter duplication. Another issue: to normalize or not to normalize? We compute a normalization based on either tokens or sents, but never apply it. The effect can be compensated for using the learning rate, as long as batches are approximately the same size. Too high learning rates lead to gradient clipping, which is extra detrimental because each component is individually clipped. Clipping deterministically requires one of the following: - access to gradients for all parameters of the entire model (infeasible) - component local clipping (current approach) - communicating a clipping factor across devices (maybe we should do this?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miscellaneous fixes to the x-transformers implementation #79

Miscellaneous fixes to the x-transformers implementation #79

Commits on Oct 7, 2024

Commits on Oct 14, 2024

Miscellaneous fixes to the x-transformers implementation #79

Are you sure you want to change the base?

Miscellaneous fixes to the x-transformers implementation #79

Commits on Oct 7, 2024

Commits on Oct 14, 2024