Merge pull request #536 from itayhubara/update_rules_for_llama_70b_lora

adding first rules draft for llama2_70b_lora
mlcommons · Apr 5, 2024 · 03320d0 · 03320d0
2 parents 4e65e34 + 0346e20
commit 03320d0
Showing 1 changed file with 15 additions and 1 deletion.
diff --git a/training_rules.adoc b/training_rules.adoc
@@ -142,6 +142,7 @@ The closed division models and quality targets are:
 |Language | Speech recognition | RNN-T | 0.058 Word Error Rate
 | |NLP |BERT |0.720 Mask-LM accuracy
 | |Large Language Model |GPT3 |2.69 log perplexity
+| |Large Language Model |Llama2-70B-LoRA |0.925 Eval loss
 |Commerce |Recommendation |DLRMv2 (DCNv2) |0.80275 AUC
 |Graphs | Node classification|R-GAT | 72.0 % classification
 |===
@@ -302,6 +303,16 @@ The MLPerf verifier scripts checks all hyperparameters except those with names m
  |gpt3 |adam |opt_init_checkpoint_step |ceil(4000 * 1536 / batch_size) |first step after loading initial checkpoint |See PR (From NV and Google, TODO Link)
  |gpt3 |adam |opt_base_learning_rate |constrained based on global_batch_size |refer to next table in section "GPT3 learning rates" |See PR (From NV and Google, TODO Link)
  |gpt3 |adam |opt_end_learning_rate |10% of opt_base_learning_rate |learning rate at the last step of decay period |See PR (From NV and Google, TODO Link)
+ |llama2_70b_lora |adamw |global_batch_size |unconstrained |batch size in sequences |See PR (From NV and Habana, TODO Link)
+ |llama2_70b_lora |adamw |opt_gradient_clip_norm |fixed to referance (0.3) | Gradients are clipped above this norm threshold. |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |lora_dropout |0.1 |fixed to reference (0.1). |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |sequence_length |8196 |the sequence length - fixed to reference |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |lora_alpha |fixed to referance (32) | scaling factor for the LoRA weight matrices |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |opt_weight_decay |fixed to referance (0.0001) |weight decay |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |gradient_accumulation_steps |unconstrained |Numer of fwd/bwd steps between optimizer step. |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |opt_learning_rate_warmup_ratio | unconstrained |ratio of steps out of training for linear warmup during initial checkpoint generation. This only affects the learning rate curve in the benchmarking region. |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |opt_learning_rate_training_steps | unconstrained |Step when the end of cosine learning rate curve is reached. Learning rate cosine decay is in range (opt_learning_rate_warmup_steps + 1,opt_learning_rate_decay_steps]. |See PR (From Habana, TODO Link)
+ |llama2_70b_lora |adamw |opt_base_learning_rate |unconstrained | base leraning rate |See PR (From Habana, TODO Link)
  |maskrcnn |sgd |global_batch_size |arbitrary constant |global version of reference SOLVER.IMS_PER_BATCH |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/data/build.py#L112[reference code]
  |maskrcnn |sgd |opt_learning_rate_decay_factor$$*$$ |fixed to reference (0.1) |learning rate decay factor |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/solver/build.py#L13[reference code]
  |maskrcnn |sgd |opt_learning_rate_decay_steps$$*$$ |(60000, 80000) * (1 + K / 10) * 16 / global_batch_size where K is integer |Steps at which learning rate is decayed |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/solver/build.py#L26[reference code]
@@ -451,6 +462,7 @@ CLOSED: The same quality measure as the reference implementation must be used. T
 |Language|Speech recognition |RNN-T|Every 1 epoch
 |        |NLP |BERT| eval_interval_samples=FLOOR(0.05*(230.23*GBS+3000000), 25000), skipping 0
 |        |large Language Model |GPT3| Every 24576 sequences. CEIL(24576 / global_batch_size) if 24576 is not divisible by GBS
+|        |large Language Model |Llama2_70B_LoRA| Every 384 sequences, CEIL(384 / global_batch_size) steps if 384 is not divisible by GBS. skipping first 3 evaluations
 |Commerce|Recommendation |DLRMv2 (DCNv2)|Every FLOOR(TOTAL_TRAINING_SAMPLES / (GLOBAL_BATCH_SIZE * NUM_EVAL) samples, where TOTAL_TRAINING_SAMPLES = 4195197692 and NUM_EVAL = 20
 |Graphs|Node classification|R-GAT|Evaluate 20 times per epoch
 |===
@@ -504,6 +516,7 @@ Each benchmark result is based on a set of run results. The number of results fo
 |Language |NLP |10
 | |Speech recognition |10
 | |Large language model |3
+| |Large language model Fine Tune (LoRA) |10
 |Commerce |Recommendation |10
 |Graphs|Node classification|10
 |===
@@ -567,6 +580,7 @@ To extract submission convergence points, logs should report epochs as follows.
 | RN50 | Epoch
 | BERT | Training sample (integer)
 | GPT3 | Training token starting from 0 (integer)
+| Llama2_70B_LoRA | Training sample (integer)
 | DLRMv2 (DCNv2) | Training iteration as the fraction of a total number of iterations for one epoch (0.05, 0.1, 0.15, ..., 1.0)
 | Stable-Diffusion | Training sample (integer)
 | SSD (RetinaNet) | Epoch
@@ -736,4 +750,4 @@ MLPerf recommends calculating _utilization_ as `model_tensor_flops / (peak_syste
 
 Use of `hardware_tensor_flops` (defined as model_tensor_flops plus operations added due to activation recomputation), instead of `model_tensor_flops` is strongly discouraged because those are not useful flops for the model. If `hardware_tensor_flops` are used for calculating utilization, it is recommended to also provide an accompanying calculation with `model_tensor_flops`.
 
-Note _utilization_ is not an official MLPerf metric.
+Note _utilization_ is not an official MLPerf metric.