Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting other data types (e.g. video) #53

Open
shakibyzn opened this issue Feb 6, 2024 · 5 comments
Open

Supporting other data types (e.g. video) #53

shakibyzn opened this issue Feb 6, 2024 · 5 comments
Labels
question Further information is requested

Comments

@shakibyzn
Copy link

Hi,

Is it possible to use Mammoth for other seq2seq problems, such as multilingual video/image captioning? What I have in mind is to prepare video features in this format (batch, n_frames, emb_size) and read them using src_embeddings in the training script while keeping the rest unchanged in the config.yml (e.g., tgt_vocab).

@TimotheeMickus
Copy link
Collaborator

TimotheeMickus commented Mar 1, 2024

Hi!
Apologies for the late answer.

Other input types are currently not implemented, although we've had this request more than once. We don't really have enough hands to look into it properly; hence up until now, we've focused on text-only applications.

It should however be feasible with a reasonably small amount of changes to the codebase, depending what you're exactly looking for. If a hacky solution and vanilla transformer layers are good enough, then you can try the following to train a model:

  1. override the function for file reading here:

    def read_examples_from_files(

    and in particular adapt the closure _make_dict to retrieve your data properly:
    def _make_example_dict(packed):
    """Helper function to convert lines to dicts"""
    src_str, tgt_str = packed
    return {
    'src': tokenize_fn(src_str, side='src'),
    'tgt': tokenize_fn(tgt_str, side='tgt') if tgt_str is not None else None,
    # 'align': None,
    }

    How to do that concretely depends on how your data is formatted.

  2. tweak the data collator function here:

    def collate_fn(self, examples):

    Batched tensors are expected to be sequence first (so your features would need to end up in the format (n_frames, batch_size, model_dim).)

  3. turn off mapping input tokens to embeddings; cf. here for the encoder if I'm not mistaken:

    emb = self.embeddings(src)
    emb = emb.transpose(0, 1).contiguous()

  4. use a default sentence-level batching function (--batch_type sents). You might need to also pass some dummy variables for the source vocab or comment out the relevant section.

If you'd like to have a look at doing that more properly, external contributions are very much welcome!

@TimotheeMickus TimotheeMickus added the question Further information is requested label Mar 1, 2024
@shakibyzn
Copy link
Author

Thank you for your detailed reply. I've been working on this for a month now and I'm able to train it properly. I haven't had a lot of experiments with that and I'm trying different hyperparameters to see if I can get to an acceptable performance. One question, isn't it possible to use loss as the criterion for early stopping?

@TimotheeMickus
Copy link
Collaborator

Yes, early stopping should be supported out of the box.

Assuming you want to evaluate every 10k steps, and stop training if it no longer improves after 5 evaluation loops, then:

  1. provide a path_valid_src and a path_valid_tgt in your task definitions for enabling validation loops
  2. include the following to your YAML config:
early_stopping: 5
early_stopping_criteria: ppl
valid_steps: 10000

The early stopper will evaluate perplexity on the validation dataset(s), which should be equivalent to the cross-entropy used to train the model.
If you need something finer-grained than that, then you can implement a Scorer object:

class Scorer(object):

and then make sure to make it available as one of the default scorers and scorer builders:

DEFAULT_SCORERS = [PPLScorer(), AccuracyScorer()]
SCORER_BUILDER = {"ppl": PPLScorer, "accuracy": AccuracyScorer}

@shakibyzn
Copy link
Author

Thank you.

@TimotheeMickus
Copy link
Collaborator

No worries! don't hesitate to share your code if you want us to include it in the library, we would welcome a pull request if you have something working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants