Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous fixes to the x-transformers implementation #79

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Commits on Oct 7, 2024

  1. Detect model blowout

    Skip backward if loss is NaN.
    Stop training if enough batches are skipped.
    Waino committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    5a73b4a View commit details
    Browse the repository at this point in the history
  2. Bugfix to validation

    Waino committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    17b6ced View commit details
    Browse the repository at this point in the history
  3. Remove more obsolete opts

    Waino committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    4f6620c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    7229141 View commit details
    Browse the repository at this point in the history
  5. Bugfix: Statisics inherits n_correct from previous instance

    The default value must be either zero or None, depending on whether
    accuracy is reported or not.
    Waino committed Oct 7, 2024
    Configuration menu
    Copy the full SHA
    203d4d5 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    83c8c26 View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2024

  1. Distributed component for TransformerWrapper

    Parameters in the TransformerWrapper, e.g. to_logits, need their own
    distributed component and optimizer.
    Waino committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    97bc2d9 View commit details
    Browse the repository at this point in the history
  2. State dict fixes

    The adapter injection code was causing parameter duplication.
    
    Another issue: to normalize or not to normalize?
    We compute a normalization based on either tokens or sents, but never
    apply it. The effect can be compensated for using the learning rate, as
    long as batches are approximately the same size. Too high learning rates
    lead to gradient clipping, which is extra detrimental because each
    component is individually clipped.
    
    Clipping deterministically requires one of the following:
    - access to gradients for all parameters of the entire model (infeasible)
    - component local clipping (current approach)
    - communicating a clipping factor across devices (maybe we should do
      this?)
    Waino committed Oct 14, 2024
    Configuration menu
    Copy the full SHA
    3a16fa0 View commit details
    Browse the repository at this point in the history