Compute intensive graph codegen
NV GPU
For CUDA platform, we have supported fusion of GEMM and its following element-wise ops (e.g. GELU, transpose), and then do codegen based on CUTLASS. Experiments show that this feature achieves up to 1.1x speedup for BERT model.
Please set DISC_ENABLE_COMPUTE_INTENSIVE_FUSE=true
if you want to try this feature.
AArch64
We introduced MLIR Transform Dialect into BladeDSIC to directly generate code for computationally intensive subgraphs. At present, the integration has been completed and the preliminary verification has been carried out on the AArch64 platform. The figure below shows some preliminary experimental results. More details can be found here and here.
Please set DISC_ENABLE_TRANSFORM_SCHEDULE=true
if you want to try this feature on AArch64.
CPU benchmark
We have integrated PyTorch Benchmark in our CI flow for cuda platform before. In this release, we further tested it on Ice Lake (X86) and Yitian 710 (AArch64) as the compass of the optimization and robustness of BladeDISC on various models. See the following summary reports:
We also add an end-to-end example to show how to perform DISC quantization optimization on Pytorch model. See the following summary reports:
PyTorch training
In PyTorch 2.0, BldeDISC can be easily integrated as one of the compiler backends to accelerates the execution efficiency of PyTorch programs. Users only need one line of code change to enable BladeDISC: torch.compile(backend='aot_disc’). BladeDISC can achieve 1.2x speedup on the BERT model. For Stable Diffusion Fine-tune task, its performance slightly exceeds the eager mode. We are still actively improving the performance.
Feature Column CodeGen
We have provided automatically fusion and codegen support for FeatureColumn graphs in TensorFlow, which are widely used in recommendation systems. Specifically, we have supported the lowering of the most common sparse ops (e.g. Where, SparseReshape, SparseFillEmptyRows and SparseSegmentReduction) in the FeatureColumn graph, and fusion optimizations for Whereand SparseSegmentReduction. Experiments show that BladeDISC achieves 1.2x speedup. Note that there is still around 20% performance gap compared with he manually optimized version. However, it provides more flexibility. please refer to here for more details.
opt | latency(ms) | speed-up |
---|---|---|
baseline | 9.77 | - |
hand-write-fusion-opt | 6.79 | 1.43x |
disc | 8.05 | 1.21x |
TorchQuant
We provide a lightweight model quantization tool to support the needs of doing quanzation for different backends. Please refer to the details here for more information. In particular, TorchQuant can be used for post-trainingQuantization (PTQ), Quantization-aware Training (QAT), and Mix-precision Quantization. At the same time, The model exported by TorchQuant can be easily optimized by BladeDISC to get better execution performance on various plaforms.
BladeDISC quantization optimization
We provide an e2e example on CPU platform (e.g. x86, aarch64) to demonstrate how to quantize a PyTorch model using TorchQuant and then optimize the quantized model using BladeDISC. The performance number is shown in CPU benchmark section.
Stable Diffusion model optimization
We provide a bunch of optimizations for Stable Diffusion model based on BladeDISC. Compared with the official implementation of diffusers, BladeDISC shows 2.42x-3.05x speedup for the pipelines (e.g. Text2Img, Img2Img), and reduces the memory usage up to 80%. More details can be found here.
Other improvements
Fix a performance bug of convolution op on CUDA platform, which greatly improves the performance of Convnets.
Ongoing work
We are now actively doing optimizations for large language models (LLMs) using BladeDISC.