Releases: alibaba/BladeDISC
BladeDISC 0.4.0
Compute intensive graph codegen
NV GPU
For CUDA platform, we have supported fusion of GEMM and its following element-wise ops (e.g. GELU, transpose), and then do codegen based on CUTLASS. Experiments show that this feature achieves up to 1.1x speedup for BERT model.
Please set DISC_ENABLE_COMPUTE_INTENSIVE_FUSE=true
if you want to try this feature.
AArch64
We introduced MLIR Transform Dialect into BladeDSIC to directly generate code for computationally intensive subgraphs. At present, the integration has been completed and the preliminary verification has been carried out on the AArch64 platform. The figure below shows some preliminary experimental results. More details can be found here and here.
Please set DISC_ENABLE_TRANSFORM_SCHEDULE=true
if you want to try this feature on AArch64.
CPU benchmark
We have integrated PyTorch Benchmark in our CI flow for cuda platform before. In this release, we further tested it on Ice Lake (X86) and Yitian 710 (AArch64) as the compass of the optimization and robustness of BladeDISC on various models. See the following summary reports:
We also add an end-to-end example to show how to perform DISC quantization optimization on Pytorch model. See the following summary reports:
PyTorch training
In PyTorch 2.0, BldeDISC can be easily integrated as one of the compiler backends to accelerates the execution efficiency of PyTorch programs. Users only need one line of code change to enable BladeDISC: torch.compile(backend='aot_disc’). BladeDISC can achieve 1.2x speedup on the BERT model. For Stable Diffusion Fine-tune task, its performance slightly exceeds the eager mode. We are still actively improving the performance.
Feature Column CodeGen
We have provided automatically fusion and codegen support for FeatureColumn graphs in TensorFlow, which are widely used in recommendation systems. Specifically, we have supported the lowering of the most common sparse ops (e.g. Where, SparseReshape, SparseFillEmptyRows and SparseSegmentReduction) in the FeatureColumn graph, and fusion optimizations for Whereand SparseSegmentReduction. Experiments show that BladeDISC achieves 1.2x speedup. Note that there is still around 20% performance gap compared with he manually optimized version. However, it provides more flexibility. please refer to here for more details.
opt | latency(ms) | speed-up |
---|---|---|
baseline | 9.77 | - |
hand-write-fusion-opt | 6.79 | 1.43x |
disc | 8.05 | 1.21x |
TorchQuant
We provide a lightweight model quantization tool to support the needs of doing quanzation for different backends. Please refer to the details here for more information. In particular, TorchQuant can be used for post-trainingQuantization (PTQ), Quantization-aware Training (QAT), and Mix-precision Quantization. At the same time, The model exported by TorchQuant can be easily optimized by BladeDISC to get better execution performance on various plaforms.
BladeDISC quantization optimization
We provide an e2e example on CPU platform (e.g. x86, aarch64) to demonstrate how to quantize a PyTorch model using TorchQuant and then optimize the quantized model using BladeDISC. The performance number is shown in CPU benchmark section.
Stable Diffusion model optimization
We provide a bunch of optimizations for Stable Diffusion model based on BladeDISC. Compared with the official implementation of diffusers, BladeDISC shows 2.42x-3.05x speedup for the pipelines (e.g. Text2Img, Img2Img), and reduces the memory usage up to 80%. More details can be found here.
Other improvements
Fix a performance bug of convolution op on CUDA platform, which greatly improves the performance of Convnets.
Ongoing work
We are now actively doing optimizations for large language models (LLMs) using BladeDISC.
BladeDISC 0.3.0: Announce PyTorch 2.0 Compilation Support
We released GPU AStitch optimization mainly last time in v0.2.0. Now we are proud to announce the release of BladeDISC v0.3.0.
Highlights
We have done the following things in the latest 6 months:
- Initial support of PyTorch 2.0 compilation;
- Contribute TorchToMHLO to Torch-MLIR, together with ByteDance AML Team;
- Add quantization compilation;
- Add compilation to Alibaba ARM-Based Yitian 710;
- Improve memory-intensive kernel code generation on GPGPU;
- Add shape constraints IR and optimizations.
PyTorch 2.0 and Dynamic Compilation
In the past half year, to support better PyTorch dynamic compilation:
- We have kept focusing on features supported for PyTorch 2.0;
- Collaborated with the Torch-MLIR community;
- Refined the architecture of TorchBlade compilation.
TorchDynamo Compilation
One can now run BladeDISC compilation with just two lines modified in PyTorch 2.0:
import torch_blade # one more extra line
model = ...
compiled_model = torch.compile(model, backend='disc')
TorchBenchmark
We added PyTorch Benchmark as the compass of the optimization and robustness of BladeDISC on various models. See the following summary reports(with BlaDNN):
TorchMLIR(MHLO) and Dynamic Shapes
BladeDISC is in a close relationship with mlir-hlo project. Part of the building blocks, including the MHLO Op definitions, TF to MHLO conversions, and some general purpose passes, have been upstreamed to mlir-hlo repository. In this release, the BladeDISC Dev Team cooperates with the community to add Torch-To-Mhlo conversion to Torch-MLIR, especially fully dynamic shape features. See RFC: llvm/torch-mlir#999. We appeal to the community developers interested in joining.
TorchBlade will now convert PyTorch workloads to MHLO based on Torch-MLIR. Then compile the MHLO modules via the BladeDISC compiler.
PyTorch Compiled Training
Release PyTorch compiled training based on PyTorch 2.0; You can find examples under BladeDISC/examples/PyTorch/Train. The features are not stable currently and in heavy developing. Please keep watching this if you are interested.
EasyCV/NLP Model Compilation
- BEVFormer: is a pure vision model for self-driving cars. BladeDISC speeds up the model by 1.42x in EasyCV.
- BladeDISC supports diffusion models as well in this release. It accelerates the end-to-end inference time by 3x in PAI-Diffusion(EasyNLP).
Quantization (Experimental)
We have completed a series of preliminary explorations on compilation and quantization. Completed early solutions and performance verification on multiple hardware, including X86, and ARM. The following table shows the results summary.
Model | Shape | Device | BeforePyTorch/FP32 | AfterInt8+Compilation |
---|---|---|---|---|
bert-mini | 8*64 | g6r / Ampere Altra / 1core | 135.9 ms | 39.6 ms |
bert-mini | 8*64 | g8m / YiTian /1core | 127.8 ms | 31.1 ms |
bert-mini | 8*64 | hfg7 / Cooper Lake 8369 /1core | 37.5 ms | 21.5 ms |
We will support more hardware (e.g. CUDA) and provide concrete examples about how to quantize PyTorch/TensorFlow models in short future. And we will continue to improve the inference performance of quantized models.
Improvement in Compilation
Alibaba ARM-Based Yitian 710
We have further improved the support for ARM-Based CPUs (especially Alibaba's Yitian) and made a series of improvements:
- Added support for BF16/int8 GEMM/Conv, making full use of the capabilities of Yitian hardware;
- A series of unique enhancements to ARM Compute Library to solve the usability issues in dynamic shape and high concurrency scenarios;
- Improved the quality of CodeGen for memory-intensive operators, including Stitch-CPU’s support for operators' reshaping and op duplication strategies.
Improvement on Mem-Intensive CodeGen
A series of in-depth optimizations are provided for code generation of memory-intensive computing on GPUs. It can bring up to 2x performance gain in inference scenarios on a single LayerNorm layer. The above feature can be enabled by setting the variable export DISC_MEM_INTENSIVE_OPT_EXPERIMENTAL=true
.
Shape Constraint IR
We have completed the design and development of Shape Constraint IR. By introducing shape constraints into IR, it is convenient for us to explore the structural shape constraints contained in the calculation graph fully. It will the optimization of dynamic shape compilations. You can reference the design document if you are interested.
Custom Pattern Matching
We reconstructed the process of connecting a custom library call in BladeDISC based on PDL, which greatly simplified the related development workload.
In the new method, one only needs to provide a PDL pattern description file and a kernel conforming to the BladeDISC runtime interface. Then the pattern replacement and the corresponding kernel call can be realized without recompiling BladeDISC.
We have used the mechanism in quantization compilation. You can refer to the examples here and here. In the future, we will further expand with the help of PDL and transform dialect so that the CodeGen strategy of a specific pattern can be injected.
Runtime Abstraction Layer
- Support for large model weight
- Concurrency performance improvement
Ongoing Work
- High-performant GEMM kernel CodeGen based on CUTLASS
- MLIR transform dialect-based CodeGen
- Accerlating sparse recommendation models in TensorFlow
BladeDISC 0.2.0
Release 0.2.0
Performance Optimization
GPU stitch fusion
Make use of GPU shared memory to fuse reduce operator with its consumers into one kernel. It helps to accommodate complex memory-intensive computations (e.g., LayerNorm, SoftMax) into one kernel, reducing off-chip memory traffics and overhead of kernel scheduling and launching. It implements partial functions described in paper AStitch. It is currently under refactoring to enhance the robustness, for which it is not enabled by default. Users of BladeDISC can enable it by setting the environment variable DISC_ENABLE_STITCH=true
.
Note that we have already released the CPU stitch optimization when we open-source the BladeDISC project, which is enabled by default. Refer to the materials for more information about CPU stitch technique details.
GEMM merging
Support two types of GEMM merging optimization. One is to merge two GEMMs sharing the same operand into a single GEMM. The other one is to merge two GEMMs with the same shape into a batched GEMM. The GEMM merging optimization helps to increase hardware utilization and to reduce kernel launch overhead.
CPU GEMM/Convolution weight pre-packing optimization
Support weight pre-packing optimization for convolution (calling onednn library) and GEMM (calling mkl/onednn/acl libraries) operations.
Convolution layout optimization and transpose elimination
Support to transform the layout of convolution operator to the friendliest format on the specific device (i.e., either CPU or GPU). Most of the introduced transpose operators can be eliminated in a following transpose-simplifier pass.
Other optimizations
- Optimize the schedule selection strategy for reduce operator on GPU to enhance thread-level-parallelism.
- Algebraic simplification for operators like power.
- Support to fuse splat constant operator with its consumers, reducing memory access overhead.
Refer to issue.
Function Enhancement
CPU end-to-end optimization
Support end-to-end optimization for X86 and AArch64 CPUs.
TorchBlade/TensorFlowBlade clustering and optimizing with TensorRT
According to the supported operators of TensorRT, cluster sub-graphs and apply TensorRT optimization for both TensorFlow and PyTorch models.
Accelerating PyTorch Training
Release PoC version for accelerating PyTorch training via Disc + Lazy Tensor Core, referring to the related issue and design doc.
Shape analysis and simplifier enhancement
Enhance the shape equality analysis according to the dimension values. Add the function to analyze the collapse and expand relationship between dimensions, which helps to identify the dimension mapping between input and output values of reshape operator. This is the basic function to support GPU stitch fusion.
Codegen support for int8 datatype
Support int8 datatype for the code generation of memory-intensive operators (e.g., element-wise, reduce operators).
Toolchain Support and Process Optimization
Replay tool
Support to dump clusters and the corresponding input data, based on which developers can replay the execution. It is effective to help debugging and tuning. Refer to issue.
CI optimization
Enhance the CI process of BladeDISC repo, which helps the people from community to contribute to BladeDISC more conveniently and efficiently.
TorchBlade bazel build
Migrate TorchBlade's compilation toolchain from the original CMake to bazel, enhancing maintainability.
Other
Example preparation
Prepare a set of commonly used models as the examples for BladeDISC. Compare the performance of BladeDISC with TensorRT, XLA and ONNX Runtime (ORT) upon the examples.
Community TF rebase
Rebase to TensorFlow codebase for BladeDISC according to the newest community code.
Code maintenance
Continuous bug fixing and code refactoring.