diff --git a/training_rules.adoc b/training_rules.adoc index 27c0508..176de33 100644 --- a/training_rules.adoc +++ b/training_rules.adoc @@ -84,6 +84,7 @@ The benchmark suite consists of the benchmarks shown in the following table. | |NLP |Wikipedia 2020/01/01 | |Large language model |c4/en/3.0.1 |Commerce |Recommendation |Criteo 1TB Click Logs (multi-hot variant) +|Graphs | Node classification | IGBH-Full |=== MLCommons provides a reference implementation of each benchmark, which includes the following elements: @@ -141,7 +142,9 @@ The closed division models and quality targets are: |Language | Speech recognition | RNN-T | 0.058 Word Error Rate | |NLP |BERT |0.720 Mask-LM accuracy | |Large Language Model |GPT3 |2.69 log perplexity +| |Large Language Model |Llama2-70B-LoRA |0.925 Eval loss |Commerce |Recommendation |DLRMv2 (DCNv2) |0.80275 AUC +|Graphs | Node classification|R-GAT | 72.0 % classification |=== Closed division benchmarks must be referred to using the benchmark name plus the term Closed, e.g. “for the Image Classification Closed benchmark, the system achieved a result of 7.2.” @@ -300,6 +303,16 @@ The MLPerf verifier scripts checks all hyperparameters except those with names m |gpt3 |adam |opt_init_checkpoint_step |ceil(4000 * 1536 / batch_size) |first step after loading initial checkpoint |See PR (From NV and Google, TODO Link) |gpt3 |adam |opt_base_learning_rate |constrained based on global_batch_size |refer to next table in section "GPT3 learning rates" |See PR (From NV and Google, TODO Link) |gpt3 |adam |opt_end_learning_rate |10% of opt_base_learning_rate |learning rate at the last step of decay period |See PR (From NV and Google, TODO Link) + |llama2_70b_lora |adamw |global_batch_size |unconstrained |batch size in sequences |See PR (From NV and Habana, TODO Link) + |llama2_70b_lora |adamw |opt_gradient_clip_norm |fixed to referance (0.3) | Gradients are clipped above this norm threshold. |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |lora_dropout |0.1 |fixed to reference (0.1). |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |sequence_length |8196 |the sequence length - fixed to reference |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |lora_alpha |fixed to referance (32) | scaling factor for the LoRA weight matrices |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |opt_weight_decay |fixed to referance (0.0001) |weight decay |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |gradient_accumulation_steps |unconstrained |Numer of fwd/bwd steps between optimizer step. |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |opt_learning_rate_warmup_ratio | unconstrained |ratio of steps out of training for linear warmup during initial checkpoint generation. This only affects the learning rate curve in the benchmarking region. |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |opt_learning_rate_training_steps | unconstrained |Step when the end of cosine learning rate curve is reached. Learning rate cosine decay is in range (opt_learning_rate_warmup_steps + 1,opt_learning_rate_decay_steps]. |See PR (From Habana, TODO Link) + |llama2_70b_lora |adamw |opt_base_learning_rate |unconstrained | base leraning rate |See PR (From Habana, TODO Link) |maskrcnn |sgd |global_batch_size |arbitrary constant |global version of reference SOLVER.IMS_PER_BATCH |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/data/build.py#L112[reference code] |maskrcnn |sgd |opt_learning_rate_decay_factor$$*$$ |fixed to reference (0.1) |learning rate decay factor |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/solver/build.py#L13[reference code] |maskrcnn |sgd |opt_learning_rate_decay_steps$$*$$ |(60000, 80000) * (1 + K / 10) * 16 / global_batch_size where K is integer |Steps at which learning rate is decayed |link:https://github.com/mlperf/training/blob/00570abf77d351e474d57830014f6a3e501dece1/object_detection/pytorch/maskrcnn_benchmark/solver/build.py#L26[reference code] @@ -388,9 +401,12 @@ The MLPerf verifier scripts checks all hyperparameters except those with names m |unet3d |sgd |evaluation_input_shape |fixed to reference |evaluation input shape |reference --val_input_shape |unet3d |sgd |data_train_samples |fixed to reference |number of training samples | N/A |unet3d |sgd |data_eval_samples |fixed to reference |number of evaluation samples | N/A + |gnn |adam |global_batch_size |arbitrary constant |global batch size |link:https://github.com/alibaba/graphlearn-for-pytorch/blob/main/examples/igbh/train_rgnn_multi_gpu.py#L293[reference code] + |gnn |adam |opt_base_learning_rate |unconstrained |base learning rate|link:https://github.com/alibaba/graphlearn-for-pytorch/blob/main/examples/igbh/train_rgnn_multi_gpu.py#L296[reference code] |=== -OPEN: Hyperparameters and optimizer may be freely changed. +OPEN: Hyperparameters and optimizer may be freely changed. + ==== GPT3 hyperparameter constraints @@ -446,7 +462,9 @@ CLOSED: The same quality measure as the reference implementation must be used. T |Language|Speech recognition |RNN-T|Every 1 epoch | |NLP |BERT| eval_interval_samples=FLOOR(0.05*(230.23*GBS+3000000), 25000), skipping 0 | |large Language Model |GPT3| Every 24576 sequences. CEIL(24576 / global_batch_size) if 24576 is not divisible by GBS +| |large Language Model |Llama2_70B_LoRA| Every 384 sequences, CEIL(384 / global_batch_size) steps if 384 is not divisible by GBS. skipping first 3 evaluations |Commerce|Recommendation |DLRMv2 (DCNv2)|Every FLOOR(TOTAL_TRAINING_SAMPLES / (GLOBAL_BATCH_SIZE * NUM_EVAL) samples, where TOTAL_TRAINING_SAMPLES = 4195197692 and NUM_EVAL = 20 +|Graphs|Node classification|R-GAT|Evaluate 20 times per epoch |=== OPEN: An arbitrary stopping criteria may be used, including but not limited to the closed quality measure, a different quality measure, the number of epochs, or a fixed time. However, the reported results must include the geometric mean of the final quality as measured by the closed quality measure. @@ -498,7 +516,9 @@ Each benchmark result is based on a set of run results. The number of results fo |Language |NLP |10 | |Speech recognition |10 | |Large language model |3 +| |Large language model Fine Tune (LoRA) |10 |Commerce |Recommendation |10 +|Graphs|Node classification|10 |=== Each benchmark result is computed by dropping the fastest and slowest runs, then taking the mean of the remaining times. For this purpose, a single non-converging run may be treated as the slowest run and dropped. A benchmark result is invalid if there is more than one non-converging run. @@ -560,12 +580,14 @@ To extract submission convergence points, logs should report epochs as follows. | RN50 | Epoch | BERT | Training sample (integer) | GPT3 | Training token starting from 0 (integer) +| Llama2_70B_LoRA | Training sample (integer) | DLRMv2 (DCNv2) | Training iteration as the fraction of a total number of iterations for one epoch (0.05, 0.1, 0.15, ..., 1.0) | Stable-Diffusion | Training sample (integer) | SSD (RetinaNet) | Epoch | Mask-RCNN | Epoch | RNN-T | Epoch | UNET3D | Epoch +| R-GAT | Training iteration as the fraction of a total number of iterations for one epoch (0.05, 0.1, 0.15, ..., 1.0) |=== === Handling RCP Failures @@ -591,6 +613,16 @@ The SWG must come to majority consensus to approve a submission that fails the R == Appendix: Benchmark Specific Rules [[benchmark_specific_rules]] +* Node Classification +** Timed region: Graph and feature loading, training, evaluation are all timed. Graph-partitioning for multi-node runs is not timed. +** Node features are in fp32 in the dataset, but lower precisions are allowed. Feature precision can be converted offline. +** Any sparse format may be used for storing the graph. Offline conversion is allowed. +** Graph partitioning algorithm and locality: +*** Any any general non-data-aware partitioning algorithm that is reproducible, either using a fixed seed or a deterministic algorithm +*** We require that each graph node’s feature can only be read from disk on one exclusive training node. Other training nodes that need this graph node’s feature should fetch it over the network +** Caching: Graph caching is allowed, but feature caching is not allowed. +** Sampler: Submitters are not expected to exactly match reference sampler implementation due to known framework differences, but must meet RCP criteria. + * Stable Diffusion ** 10 runs per submission ** Checkpoint must be collected every 512,000 images. CEIL(512000 / global_batch_size) if 512000 is not divisible by GBS. @@ -718,4 +750,4 @@ MLPerf recommends calculating _utilization_ as `model_tensor_flops / (peak_syste Use of `hardware_tensor_flops` (defined as model_tensor_flops plus operations added due to activation recomputation), instead of `model_tensor_flops` is strongly discouraged because those are not useful flops for the model. If `hardware_tensor_flops` are used for calculating utilization, it is recommended to also provide an accompanying calculation with `model_tensor_flops`. -Note _utilization_ is not an official MLPerf metric. +Note _utilization_ is not an official MLPerf metric. \ No newline at end of file