[NVIDIA] Added transformer engine support and GPU optimizations #1385

terrykong · 2023-08-26T23:21:08Z

Added Transformer Engine + FP8 support
Updated T5x and jax version=0.4.11
A100 Perf gains!
- 80% speedup - T5-small
- 23% speedup - T5-large
- 18% speedup - T5-xl
- 40% speedup - T5-xxl
H100 support, with gains over A100
- 2.08x faster - T5-large
- 2.24x faster - T5-xl

google-cla · 2023-08-26T23:21:15Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Co-authored-by: Sahil Jain <sahilj@nvidia.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Yu-Hang Tang <yuhangt@nvidia.com> Co-authored-by: Ming Huang <mingh@nvidia.com> Co-authored-by: Frederic Bastien <fbastien@nvidia.com> Co-authored-by: Sharath Turuvekere Sreenivas <sharatht@nvidia.com> Co-authored-by: Xiaowei Ren <xren@nvidia.com> Co-authored-by: Ryan Jeng <rjeng@nvidia.com> Co-authored-by: Reese Wang <rewang@nvidia.com>

training status

configs use packing (CV/Multimodal)

Updated T5x-large MNLI and SQUAD baselines

jon-chuang · 2023-09-06T09:16:18Z

Hello, out of curiosity (while I understand it may not be tested), would this in theory be able to support training/fine-tuning for models built on top of t5x like Flan-UL2?

I guess yes, as it is simply a t5x model with specific config?

terrykong · 2023-09-06T18:03:45Z

@jon-chuang Yes, I believe that's correct given my understanding of the followup architectures to T5: UL2/Flan-T5/Flan-UL2. As long as the core model is the same and only the objective/inputs&targets change, those finetunings should also benefit.

terrykong · 2023-09-15T04:21:38Z

Closing in favor of #1391

terrykong and others added 7 commits August 26, 2023 17:07

UNINSTALL_TE in fine-tuning scripts now defaults to no-action

336d640

remove use_gda from LegacyCheckpointManager in train.py for fp8

462b9fb

Allow singlenode scripts to tee to stdout for better indication of

ad28880

training status

Explicit specify self_attn_mask_type

d250182

Disables check for packing by the te_helper util since not all dataset

985ff36

configs use packing (CV/Multimodal)

Corrected T5x large baselines

1fa57af

Updated T5x-large MNLI and SQUAD baselines

terrykong force-pushed the patch/t5x_te_in_contrib_noindent branch from 80ae059 to 1fa57af Compare August 27, 2023 00:08

terrykong closed this Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Added transformer engine support and GPU optimizations #1385

[NVIDIA] Added transformer engine support and GPU optimizations #1385

terrykong commented Aug 26, 2023

google-cla bot commented Aug 26, 2023

jon-chuang commented Sep 6, 2023

terrykong commented Sep 6, 2023

terrykong commented Sep 15, 2023

[NVIDIA] Added transformer engine support and GPU optimizations #1385

[NVIDIA] Added transformer engine support and GPU optimizations #1385

Conversation

terrykong commented Aug 26, 2023

google-cla bot commented Aug 26, 2023

jon-chuang commented Sep 6, 2023

terrykong commented Sep 6, 2023

terrykong commented Sep 15, 2023