Skip to content

Commit

Permalink
Update docs for ConcatDataset (#1181)
Browse files Browse the repository at this point in the history
  • Loading branch information
joecummings authored Jul 15, 2024
1 parent 1d88c22 commit f17335a
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 3 deletions.
5 changes: 3 additions & 2 deletions docs/source/tutorials/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -395,8 +395,9 @@ you can also add more advanced behavior.
Multiple in-memory datasets
---------------------------

It is also possible to train on multiple datasets and configure them individually.
You can even mix instruct and chat datasets or other custom datasets.
It is also possible to train on multiple datasets and configure them individually using
our :class:`~torchtune.datasets.ConcatDataset` interface. You can even mix instruct and chat datasets
or other custom datasets.

.. code-block:: yaml
Expand Down
18 changes: 17 additions & 1 deletion torchtune/datasets/_concat.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,29 @@ class ConcatDataset(Dataset):
_indexes (List[Tuple[int, int, int]]): A list of tuples where each tuple contains the starting index, the
ending index, and the dataset index for quick lookup and access during indexing operations.
Example:
Examples:
>>> dataset1 = MyCustomDataset(params1)
>>> dataset2 = MyCustomDataset(params2)
>>> concat_dataset = ConcatDataset([dataset1, dataset2])
>>> print(len(concat_dataset)) # Total length of both datasets
>>> data_point = concat_dataset[1500] # Accesses an element from the appropriate dataset
This can also be accomplished by passing in a list of datasets to the YAML config::
dataset:
- _component_: torchtune.datasets.instruct_dataset
source: vicgalle/alpaca-gpt4
template: torchtune.data.AlpacaInstructTemplate
split: train
train_on_input: True
- _component_: torchtune.datasets.instruct_dataset
source: samsum
template: torchtune.data.SummarizeTemplate
column_map: {"output": "summary"}
output: summary
split: train
train_on_input: False
This class primarily focuses on providing a unified interface to access elements from multiple datasets,
enhancing the flexibility in handling diverse data sources for training machine learning models.
"""
Expand Down

0 comments on commit f17335a

Please sign in to comment.