[NVIDIA] Added transformer engine support and GPU optimizations #1391

terrykong · 2023-09-08T22:17:02Z

Added Transformer Engine + FP8 support
Updated T5x and jax version=0.4.11
A100 Perf gains!
- 80% speedup - T5-small
- 23% speedup - T5-large
- 18% speedup - T5-xl
- 40% speedup - T5-xxl
H100 support, with gains over A100
- 2.08x faster - T5-large
- 2.24x faster - T5-xl

google-cla · 2023-09-08T22:17:09Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Co-authored-by: Sahil Jain <sahilj@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Yu-Hang Tang <yuhangt@nvidia.com> Co-authored-by: Ming Huang <mingh@nvidia.com> Co-authored-by: Frederic Bastien <fbastien@nvidia.com> Co-authored-by: Sharath Turuvekere Sreenivas <sharatht@nvidia.com> Co-authored-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: Ryan Jeng <rjeng@nvidia.com> Co-authored-by: Reese Wang <rewang@nvidia.com>

training status

configs use packing (CV/Multimodal)

Updated T5x-large MNLI and SQUAD baselines

Co-authored-by: NVIDIA <jax@nvidia.com>

…e_fp8 (#8) * Allows ENABLE_TE env var to control whether TE code path is invoked * Changes enabled -> enable_fp8 to be more consistent with PAX and avoid confusion with ENABLE_TE * Remove UNINSTALL_TE logic since it is no longer required --------- Co-authored-by: NVIDIA <jax@nvidia.com>

… of input variables (#9) * Update multiprocess scripts * No longer need UNINSTALL_TE * Removes slurm scripts as the source of truth has moved to rosetta * Adds "Finished" message to multiprocess scripts * Remove BENCHMARK_ARGS which is no longer used * Fix typo in BENCHMARK_MODE and straggling if keyword * Address comments

Signed-off-by: Reese Wang <rewang@nvidia.com>

terrykong mentioned this pull request Sep 15, 2023

[NVIDIA] Added transformer engine support and GPU optimizations #1385

Closed

terrykong force-pushed the patch/t5x_te_in_contrib_noindent branch 2 times, most recently from c82b280 to 443df5e Compare October 9, 2023 20:23

ashors1 force-pushed the patch/t5x_te_in_contrib_noindent branch from 4346418 to d08a684 Compare November 28, 2023 06:43

olupton mentioned this pull request Jan 4, 2024

T5X rosetta nightlies are broken NVIDIA/JAX-Toolbox#448

Closed

ashors1 force-pushed the patch/t5x_te_in_contrib_noindent branch from 74d742f to 79bf053 Compare January 8, 2024 21:38

terrykong force-pushed the patch/t5x_te_in_contrib_noindent branch from 79bf053 to 3ca8e34 Compare February 21, 2024 17:31

terrykong and others added 18 commits March 8, 2024 09:49

UNINSTALL_TE in fine-tuning scripts now defaults to no-action

dcbbb37

remove use_gda from LegacyCheckpointManager in train.py for fp8

db6fc55

Allow singlenode scripts to tee to stdout for better indication of

a39a08e

training status

Explicit specify self_attn_mask_type

39e637f

Disables check for packing by the te_helper util since not all dataset

d016f83

configs use packing (CV/Multimodal)

Corrected T5x large baselines

83a2b20

Updated T5x-large MNLI and SQUAD baselines

Add t5-large FP8 logs

5944f07

Fix missing fp8_meta_collection in the eval stage.

2d2fbe8

Remove redundant code.

4a86f76

Fix deprecating warning about TE.

7b878db

Updates TE api from te.extend_* to te.flax.extend_* (#7)

4c60477

Co-authored-by: NVIDIA <jax@nvidia.com>

Adapting to TE/JAX/Custom_partitioning.

4abe3e5

Running Partitioner.compile within Mesh context-manager

bfa6313

Force initial flax mutables to be a frozen dict (#11)

189868b

update rng dtype in predict_batch

06be7c2

terrykong force-pushed the patch/t5x_te_in_contrib_noindent branch from 3ca8e34 to 06be7c2 Compare March 8, 2024 17:57

Change decoder attn mask type to padding_causal

339b034

Signed-off-by: Reese Wang <rewang@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Added transformer engine support and GPU optimizations #1391

[NVIDIA] Added transformer engine support and GPU optimizations #1391

terrykong commented Sep 8, 2023

google-cla bot commented Sep 8, 2023

[NVIDIA] Added transformer engine support and GPU optimizations #1391

Are you sure you want to change the base?

[NVIDIA] Added transformer engine support and GPU optimizations #1391

Conversation

terrykong commented Sep 8, 2023

google-cla bot commented Sep 8, 2023