Add softcapping support to flash attention #2437

EricLBuehler · 2024-08-22T01:43:21Z

No description provided.

* Offset it * Freeze * Offset it * Offset it * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Try out vllm impl again * Remove debugs * Polish it up * Polish it up * Clippy * Remove test file * Add config for if neox * Fix bug * Fix bug * Cast cache type on rust side * Cast types * To dtype * Drop temp * Update casting * Update casting * Update casting * Create dtype in bf16 * Check type * Debug * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Check dtype * Debug * Debug * Debug * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Check old method * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Use mistral slow rope impl * Reseting * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Remove debug * Debug * Debug * Remove debug * Remove debug * Debug * Remove debug * Debug * Remove debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Debug * Try to use 3dim rotemb fused * Try to use 3dim rotemb fused * Remove contig and debug * Check handling * Cleanup * Fix * Remove prints * Lower block dim * Use fused layernorm * Pass batch size * Simplify internal API * Simplify internal API * Try slow * Try candle layer norm * Try candle layer norm * Fix dep of candle layer norm * Reshape input for rank 2 * Reshape input for rank 2 * Fix ref * Code style * Make dep optional * Ensure contig * Ensure contig * Ensure contig * Debug contig dmmv error * Debug contig dmmv error * Debug contig dmmv error * Debug contig dmmv error * Try other method * Try other method * Try other method * Try other method * Try other method * Use typestate to optimize * Use typestate to optimize * Fixes * Fixes * Fixes * Fixes * Fixes * Debug via using slow rmsnorm * Debug via using slow rope * Remove debug * More debugging * Remove debug * Remove debug * Remove debug * Add better error enum * Fix diff marker * Fix some things * Fix some things * Fix some things * Fix dummy backends * Re add from storage noop * Fix removed kvconcat custom op * Fix erroneous feature gate * Complete metal backend refactoring * Check if calling * Check if calling * Update default for force dmmv * Load atomic * Debug * Use mmvq * Update * Add the empty functions * Add rope new_partial function * Make variant of qmatmul pub * Make variant of qmatmul pub * Add the varbuilder set_device function * Only link stdc++ if target has msvc * Only link stdc++ if target has msvc * Only link stdc++ if target has msvc * Only link stdc++ if target has msvc * Handle case of device mapping * Handle case of device mapping * Add getter * Fix * Fix * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Support nvcc flags in flash attn * Fixes * Fixes * Fix the tests * Fix the tests

* Support flash-attn in quantized phi3. (huggingface#2194) * Use flash-attn in gemma. (huggingface#2195) * Use flash-attn in gemma. * Fix flash-attn for head dim 256. * Remove candle-layer-norm --------- Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>

* Add unfold * Format

* Add the quantize_onto api * Take ref * Clippy * Format * Add error checking

* Use flash-attn in gemma. * Fix for the fast bf16 cublas gemm. * Fix some clippy lints. * Fix another lint. * Proper clippy fix.

* define structs * construct ResidualConvUnit * forward() for ResidualConvUnit * implement FeatureFusionBlock * implement Scratch * implement DPTHead * add identity module * implement forward for DTPHead * add get_intermediate_layers to DinoVisionTransformer * implement DepthAnythingV2 * some minor tweaks * fix compile errors * fix var builder prefixes * setup initial example * use fixed patch size of 37 (518 / 14) * debugged until output * print min and max values * add some dynamism to the output location * scale input image * extract prep function * extract output path function * normalize image with magic mean and std * add spectral coloring * squeeze in the right place * make enterpolation optional * use bail instead of panic * omit unnecessary Shape call * remove empty curly braces * use bail instead of assert * use vb and pp * remove closures * extract config object * Apply rustfmt. * Fix some clippy lints. * More lints. * Use the array methods. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

* feat(gemm): implement Gemm operator in candle-onnx * feat(onnx): Add support for ArgMax operator in candle-onnx * Apply rustfmt. * Remove argmax as it was already present. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

* Add: DINOv2Reg4 with PlantCLEF2024 weights and example ( See https://arxiv.org/abs/2309.16588 and https://zenodo.org/records/10848263 ) * Remove extra files + update README to download them + remove extra lines * minor fix (README remove extra spaces) * minor fix (README: Fix image url) * Modif: Add back interpolate_pos_encoding() + fix when no interpolation + remove extra comments + Update README ( source image changed and so the predictions ) * Fix: Improve code lisibility with '$ cargo clippy' and '$ cargo fmt' * Another clippy fix. --------- Co-authored-by: x-VEspit <vincent.espitalier@cirad.fr> Co-authored-by: laurent <laurent.mazare@gmail.com>

…e#2299)

* Add i32 dtype for cpu and cuda, with kernels * Fix cuda i32 * Fix cpu i32 * Add cuda map impls for i32 * Start to add to metal * Add the kernels * Oops * Fix dtype cast in safetensors * Oops * Oops * Add bf16 to i32 and vice versa casts

* Add the flux autoencoder. * Add the encoder down-blocks. * Upsampling in the decoder. * Sketch the flow matching model. * More flux model. * Add some of the positional embeddings. * Add the rope embeddings. * Add the sampling functions. * Add the flux example. * Fix the T5 bits. * Proper T5 tokenizer. * Clip encoder path fix. * Get the clip embeddings. * No configurable weights in layer norm. * More weights related fixes. * Yet another shape fix. * DType fix. * Fix a couple more shape issues. * DType fixes. * Fix the latent dims. * Fix more shape issues. * Autoencoder fixes. * Get some generations out. * Bugfix. * T5 padding. * Clippy fix. * Add the decode only mode. * Fix. * More fixes. * Finally get some generations to work. * Add readme.

* add models support and example for THUDM/glm-4 * fix the ci report * fmt * fix * Update README.org * Update README.org * fmt * Update README.org * README.md add codegeex4 * README.md add glm4 * Typo. * change expect into ? --------- Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>

* add mmdit of stable diffusion 3 lint add comments * correct a misplaced comment * fix cargo fmt * fix clippy error * use bail! instead of assert! * use get_on_dim in splitting qkv

* chore: changes from formatting on save * fix: usage of `actions/checkout@v2`

Also squeeze the first dimension of the codes tensor in the example file to get the expected three dimensions.

* Soft NMS with thresholds * NMS Test * Soft nms w/ boxes removed below threshold * Soft nms test * No longer removing bounding boxes to fit Soft-NMS focus * Initialize confidence * Added comments * Refactored out updating based on IOU/sigma * Score_threshold -> confidence_threshold for clarity * Remove bboxes below confidence threshold * Softnms basic functionality test * Softnms confidence decay test * Softnms confidence threshold test * Softnms no overlapping bbox test * Testing confidence after no overlap test * Single bbox and no bbox tests * Signify test completion * Handling result of test functions * Checking all pairs of bboxes instead of a forward pass * Equal confidence overlap test * Clarified tests for implementation * No longer dropping boxes, just setting to 0.0 * Formatted w/ cargo

…ds (huggingface#2308) * Add documentation examples for `Tensor` methods * Apply fmt. * Cosmetic tweaks. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

* Clippy fixes. * Bump the web_sys required version.

* Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16

EricLBuehler and others added 30 commits May 15, 2024 15:10

Merge remote-tracking branch 'upstream/main'

4e82fab

Merge remote-tracking branch 'upstream/main'

37cafcc

fix issue with cuda header file for A10G (#5)

5892fac

Merge remote-tracking branch 'upstream/main'

9b151f5

Merge

38f8d9e

Merge

c10fc33

Merge remote-tracking branch 'upstream/main'

527ebcc

Merge remote-tracking branch 'upstream/main'

bfc197b

Merge remote-tracking branch 'upstream/main'

0c2ac76

Merge remote-tracking branch 'upstream/main'

cb3dbc2

Add a set_dtype method

faa9435

Merge remote-tracking branch 'upstream/main'

462d948

Merge remote-tracking branch 'upstream/main'

5c06acd

Add more capability to slice_assign (#7)

696acaa

Implement unfold (#8)

0936406

* Add unfold * Format

Merge remote-tracking branch 'upstream/main'

636de1d

Bump cudarc to 0.11.5 (#10)

f52e234

Add QTensor::quantize_onto (#12)

bb8f6f0

* Add the quantize_onto api * Take ref * Clippy * Format * Add error checking

implement Slice op (huggingface#2260)

5b04d96

Fix the fast bf16 gemm cublas kernels. (huggingface#2274)

f7095bb

* Use flash-attn in gemma. * Fix for the fast bf16 cublas gemm. * Fix some clippy lints. * Fix another lint. * Proper clippy fix.

Fix a bug in the metal implemtation of col2im1d. (huggingface#2284)

b55b360

make up for the missing last token output of phi2 example (huggingfac…

b438cba

…e#2299)

Patch metal function

b7a3e34

Complete merge

c967be9

Expose cublas handle

9e09d7f

EricLBuehler and others added 27 commits August 7, 2024 17:06

Simplify things a bit

27ca77e

Mistral.rs GPTQ dev PR (#14)

7ad6494

* Add i32 dtype for cpu and cuda, with kernels * Fix cuda i32 * Fix cpu i32 * Add cuda map impls for i32 * Start to add to metal * Add the kernels * Oops * Fix dtype cast in safetensors * Oops * Oops * Add bf16 to i32 and vice versa casts

Fix on metal

6f0e190

Simplify handling of flux modulations. (huggingface#2394)

0a146d7

optimize gradient for silu a bit (huggingface#2393)

0f55c37

Support the flux-dev model too. (huggingface#2395)

aef4eba

Support for mistral-nemo. (huggingface#2396)

c301efa

Add the MMDiT model of Stable Diffusion 3 (huggingface#2397)

f8e2b36

* add mmdit of stable diffusion 3 lint add comments * correct a misplaced comment * fix cargo fmt * fix clippy error * use bail! instead of assert! * use get_on_dim in splitting qkv

Add the import script for the T5 tokenizer. (huggingface#2399)

0e78d29

fix: usage of actions/checkout@v2 (huggingface#2403)

1b796b9

* chore: changes from formatting on save * fix: usage of `actions/checkout@v2`

Fix issues in the encodec example README.md (huggingface#2407)

c9cdd54

Also squeeze the first dimension of the codes tensor in the example file to get the expected three dimensions.

Add documentation examples for Tensor::i and Tensor::narrow metho…

de719a2

…ds (huggingface#2308) * Add documentation examples for `Tensor` methods * Apply fmt. * Cosmetic tweaks. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

Add Based LLM from Hazy Research. (huggingface#2411)

2e72a3d

Fix the device for the bert attention mask. (huggingface#2414)

d7a9bd0

Clippy fixes. (huggingface#2415)

3d40ffc

* Clippy fixes. * Bump the web_sys required version.

Update flash_fwd_launch_template.h with fix for kernels (#16)

c5c5d49

Build fixes

2386e4e

Merge branch 'sdpa'

a38053f

Add GGUF BF16 support (#17)

1b1974e

* Add GGUF bf16 type support * Add non avx impl for vec_dot_bf16 * Fix from_u32 * Fix loading * Fix dequant of bf16

Expose the softcap methods

d632eb5

Add some tests

da095a6

Merge remote-tracking branch 'upstream/main'

36bd9f9

Complete merge

6fbddd6

Merge branch 'main' into flash_attn_softcap

a3431d1

EricLBuehler closed this Aug 22, 2024

EricLBuehler deleted the flash_attn_softcap branch August 22, 2024 01:43

EricLBuehler restored the flash_attn_softcap branch August 22, 2024 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add softcapping support to flash attention #2437

Add softcapping support to flash attention #2437

EricLBuehler commented Aug 22, 2024

Add softcapping support to flash attention #2437

Add softcapping support to flash attention #2437

Conversation

EricLBuehler commented Aug 22, 2024