Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding first rules draft for llama2_70b_lora #536

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions training_rules.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ The benchmark suite consists of the benchmarks shown in the following table.
| |NLP |Wikipedia 2020/01/01
| |Large language model |c4/en/3.0.1
|Commerce |Recommendation |Criteo 1TB Click Logs (multi-hot variant)
|Graphs | Node classification | IGBH-Full
|===

MLCommons provides a reference implementation of each benchmark, which includes the following elements:
Expand Down Expand Up @@ -141,7 +142,9 @@ The closed division models and quality targets are:
|Language | Speech recognition | RNN-T | 0.058 Word Error Rate
| |NLP |BERT |0.720 Mask-LM accuracy
| |Large Language Model |GPT3 |2.69 log perplexity
| |Large Language Model |Llama2-70B-LoRA |0.925 Eval loss
|Commerce |Recommendation |DLRMv2 (DCNv2) |0.80275 AUC
|Graphs | Node classification|R-GAT | 72.0 % classification
|===

Closed division benchmarks must be referred to using the benchmark name plus the term Closed, e.g. “for the Image Classification Closed benchmark, the system achieved a result of 7.2.”
Expand Down Expand Up @@ -300,6 +303,16 @@ The MLPerf verifier scripts checks all hyperparameters except those with names m
|gpt3 |adam |opt_init_checkpoint_step |ceil(4000 * 1536 / batch_size) |first step after loading initial checkpoint |See PR (From NV and Google, TODO Link)
|gpt3 |adam |opt_base_learning_rate |constrained based on global_batch_size |refer to next table in section "GPT3 learning rates" |See PR (From NV and Google, TODO Link)
|gpt3 |adam |opt_end_learning_rate |10% of opt_base_learning_rate |learning rate at the last step of decay period |See PR (From NV and Google, TODO Link)
|llama2_70b_lora |adamw |global_batch_size |unconstrained |batch size in sequences |See PR (From NV and Habana, TODO Link)
|llama2_70b_lora |adamw |opt_gradient_clip_norm |fixed to referance (0.3) | Gradients are clipped above this norm threshold. |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |lora_dropout |0.1 |fixed to reference (0.1). |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |sequence_length |8196 |the sequence length - fixed to reference |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |lora_alpha |fixed to referance (32) | scaling factor for the LoRA weight matrices |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |opt_weight_decay |fixed to referance (0.0001) |weight decay |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |gradient_accumulation_steps |unconstrained |Numer of fwd/bwd steps between optimizer step. |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |opt_learning_rate_warmup_ratio | unconstrained |ratio of steps out of training for linear warmup during initial checkpoint generation. This only affects the learning rate curve in the benchmarking region. |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |opt_learning_rate_training_steps | unconstrained |Step when the end of cosine learning rate curve is reached. Learning rate cosine decay is in range (opt_learning_rate_warmup_steps + 1,opt_learning_rate_decay_steps]. |See PR (From Habana, TODO Link)
|llama2_70b_lora |adamw |opt_base_learning_rate |unconstrained | base leraning rate |See PR (From Habana, TODO Link)
|maskrcnn |sgd |global_batch_size |arbitrary constant |global version of reference SOLVER.IMS_PER_BATCH |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/data/build.py#L112[reference code]
|maskrcnn |sgd |opt_learning_rate_decay_factor$$*$$ |fixed to reference (0.1) |learning rate decay factor |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/solver/build.py#L13[reference code]
|maskrcnn |sgd |opt_learning_rate_decay_steps$$*$$ |(60000, 80000) * (1 + K / 10) * 16 / global_batch_size where K is integer |Steps at which learning rate is decayed |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/solver/build.py#L26[reference code]
Expand Down Expand Up @@ -388,9 +401,12 @@ The MLPerf verifier scripts checks all hyperparameters except those with names m
|unet3d |sgd |evaluation_input_shape |fixed to reference |evaluation input shape |reference --val_input_shape
|unet3d |sgd |data_train_samples |fixed to reference |number of training samples | N/A
|unet3d |sgd |data_eval_samples |fixed to reference |number of evaluation samples | N/A
|gnn |adam |global_batch_size |arbitrary constant |global batch size |link:https://github.com/alibaba/graphlearn-for-pytorch/blob/main/examples/igbh/train_rgnn_multi_gpu.py#L293[reference code]
|gnn |adam |opt_base_learning_rate |unconstrained |base learning rate|link:https://github.com/alibaba/graphlearn-for-pytorch/blob/main/examples/igbh/train_rgnn_multi_gpu.py#L296[reference code]
|===

OPEN: Hyperparameters and optimizer may be freely changed.
OPEN: Hyperparameters and optimizer may be freely changed.


==== GPT3 hyperparameter constraints

Expand Down Expand Up @@ -446,7 +462,9 @@ CLOSED: The same quality measure as the reference implementation must be used. T
|Language|Speech recognition |RNN-T|Every 1 epoch
| |NLP |BERT| eval_interval_samples=FLOOR(0.05*(230.23*GBS+3000000), 25000), skipping 0
| |large Language Model |GPT3| Every 24576 sequences. CEIL(24576 / global_batch_size) if 24576 is not divisible by GBS
| |large Language Model |Llama2_70B_LoRA| Every 384 sequences, CEIL(384 / global_batch_size) steps if 384 is not divisible by GBS. skipping first 3 evaluations
|Commerce|Recommendation |DLRMv2 (DCNv2)|Every FLOOR(TOTAL_TRAINING_SAMPLES / (GLOBAL_BATCH_SIZE * NUM_EVAL) samples, where TOTAL_TRAINING_SAMPLES = 4195197692 and NUM_EVAL = 20
|Graphs|Node classification|R-GAT|Evaluate 20 times per epoch
|===

OPEN: An arbitrary stopping criteria may be used, including but not limited to the closed quality measure, a different quality measure, the number of epochs, or a fixed time. However, the reported results must include the geometric mean of the final quality as measured by the closed quality measure.
Expand Down Expand Up @@ -498,7 +516,9 @@ Each benchmark result is based on a set of run results. The number of results fo
|Language |NLP |10
| |Speech recognition |10
| |Large language model |3
| |Large language model Fine Tune (LoRA) |10
|Commerce |Recommendation |10
|Graphs|Node classification|10
|===

Each benchmark result is computed by dropping the fastest and slowest runs, then taking the mean of the remaining times. For this purpose, a single non-converging run may be treated as the slowest run and dropped. A benchmark result is invalid if there is more than one non-converging run.
Expand Down Expand Up @@ -560,12 +580,14 @@ To extract submission convergence points, logs should report epochs as follows.
| RN50 | Epoch
| BERT | Training sample (integer)
| GPT3 | Training token starting from 0 (integer)
| Llama2_70B_LoRA | Training sample (integer)
| DLRMv2 (DCNv2) | Training iteration as the fraction of a total number of iterations for one epoch (0.05, 0.1, 0.15, ..., 1.0)
| Stable-Diffusion | Training sample (integer)
| SSD (RetinaNet) | Epoch
| Mask-RCNN | Epoch
| RNN-T | Epoch
| UNET3D | Epoch
| R-GAT | Training iteration as the fraction of a total number of iterations for one epoch (0.05, 0.1, 0.15, ..., 1.0)
|===

=== Handling RCP Failures
Expand All @@ -591,6 +613,16 @@ The SWG must come to majority consensus to approve a submission that fails the R

== Appendix: Benchmark Specific Rules [[benchmark_specific_rules]]

* Node Classification
** Timed region: Graph and feature loading, training, evaluation are all timed. Graph-partitioning for multi-node runs is not timed.
** Node features are in fp32 in the dataset, but lower precisions are allowed. Feature precision can be converted offline.
** Any sparse format may be used for storing the graph. Offline conversion is allowed.
** Graph partitioning algorithm and locality:
*** Any any general non-data-aware partitioning algorithm that is reproducible, either using a fixed seed or a deterministic algorithm
*** We require that each graph node’s feature can only be read from disk on one exclusive training node. Other training nodes that need this graph node’s feature should fetch it over the network
** Caching: Graph caching is allowed, but feature caching is not allowed.
** Sampler: Submitters are not expected to exactly match reference sampler implementation due to known framework differences, but must meet RCP criteria.

* Stable Diffusion
** 10 runs per submission
** Checkpoint must be collected every 512,000 images. CEIL(512000 / global_batch_size) if 512000 is not divisible by GBS.
Expand Down Expand Up @@ -718,4 +750,4 @@ MLPerf recommends calculating _utilization_ as `model_tensor_flops / (peak_syste

Use of `hardware_tensor_flops` (defined as model_tensor_flops plus operations added due to activation recomputation), instead of `model_tensor_flops` is strongly discouraged because those are not useful flops for the model. If `hardware_tensor_flops` are used for calculating utilization, it is recommended to also provide an accompanying calculation with `model_tensor_flops`.

Note _utilization_ is not an official MLPerf metric.
Note _utilization_ is not an official MLPerf metric.
Loading