Add some fast Metal MLX SDPA kernels #2584

EricLBuehler · 2024-10-29T12:17:24Z

This PR adds some MLX SDPA kernels on Metal.

I can observe about a 26% performance improvement with Llama 3.1 8b @ q4k and @ q8_0 when testing through mistral.rs on my Candle fork. I updated the quantized_llama.rs file here to use the new function.

This PR adds a function candle_nn::ops::sdpa. The MLX attention kernels don't support masking yet, so the performance gains are only for decoding on Metal. Once/if they do, I'll update them - otherwise we can explore using Flash Attention kernels for Metal from llama.cpp.

* Sketch the sdpa kernel * Add full sdpa kernel, * Add test * Add vectorized kernel for decoding * Update tests * Add some docs * Fix sdpa_vector names * Add softcapping for vectorized sdpa * Add softcapping for full sdpa * Add support for head dim 32, 96, 256 * Add support for head dim 32, 96, 256 * Update docs * Add update notice * Clippy and format

LaurentMazare · 2024-10-30T05:20:54Z

.vscode/settings.json

@@ -7,5 +7,6 @@
        "candle-pyo3"
    ],
    "python.testing.unittestEnabled": false,
-    "python.testing.pytestEnabled": true
+    "python.testing.pytestEnabled": true,
+    "rust-analyzer.cargo.features": ["metal"]


Wouldn't that break the editor for non metal users?

LaurentMazare · 2024-10-30T05:22:14Z

candle-metal-kernels/src/lib.rs

+    // q = (bs, qhead, seq, hidden)
+    // k/v = (bs, kv_head, seq, hidden)
+
+    let _hidden = q_shape[q_shape.len() - 1];


If variables such as _hidden are not planned to be used, maybe remove them?

LaurentMazare · 2024-10-30T05:29:06Z

candle-metal-kernels/src/lib.rs

+    encoder.set_buffer(0, Some(&q_buffer), q_offset as NSUInteger);
+    encoder.set_buffer(1, Some(&k_buffer), k_offset as NSUInteger);
+    encoder.set_buffer(2, Some(&v_buffer), v_offset as NSUInteger);
+    encoder.set_buffer(3, Some(&output), 0);
+
+    encoder.set_bytes(
+        4,
+        std::mem::size_of::<MLXFastAttentionParams>() as u64,
+        &params as *const MLXFastAttentionParams as *const c_void,
+    );
+    encoder.set_bytes(
+        6,
+        (std::mem::size_of::<i32>() * batch_shape.len()) as u64,
+        batch_shape.as_ptr() as *const i32 as *const c_void,
+    );
+    encoder.set_bytes(
+        7,
+        (std::mem::size_of::<usize>() * batch_strides.len()) as u64,
+        batch_strides.as_ptr() as *const c_void,
+    );


Couldn't the EncoderParam trait or better the set_params! macro be used to simplify this?

LaurentMazare · 2024-10-30T05:29:55Z

candle-metal-kernels/src/lib.rs

+    encoder.set_buffer(0, Some(&q_buffer), q_offset as NSUInteger);
+    encoder.set_buffer(1, Some(&k_buffer), k_offset as NSUInteger);
+    encoder.set_buffer(2, Some(&v_buffer), v_offset as NSUInteger);
+    encoder.set_buffer(3, Some(&output), 0);
+
+    encoder.set_bytes(
+        4,
+        std::mem::size_of::<i32>() as u64,
+        &gqa_factor as *const i32 as *const c_void,
+    );
+    encoder.set_bytes(
+        5,
+        std::mem::size_of::<i32>() as u64,
+        &n as *const i32 as *const c_void,
+    );
+    encoder.set_bytes(
+        6,
+        std::mem::size_of::<usize>() as u64,
+        &stride as *const usize as *const c_void,
+    );
+    encoder.set_bytes(
+        7,
+        std::mem::size_of::<f32>() as u64,
+        &alpha as *const f32 as *const c_void,
+    );
+    encoder.set_bytes(
+        8,
+        std::mem::size_of::<f32>() as u64,
+        &softcapping as *const f32 as *const c_void,
+    );


Same comment as above, set_params! should be an easy win here.

LaurentMazare · 2024-10-30T05:32:10Z

candle-nn/src/ops.rs

+            candle::bail!("query `n_heads` must be a multiple of `n_kv_heads`");
+        }
+
+        let k_head = k_l.dims()[k_l.dims().len() - 1];


k_l.dim(D::Minus1)? would be simpler and make for a better error message than a panic (the same applies to a bunch of places in this function)

LaurentMazare · 2024-10-30T05:34:49Z

candle-nn/src/ops.rs

+
+impl candle::CustomOp3 for Sdpa {
+    fn name(&self) -> &'static str {
+        "sdpa"


Let's call this metal-sdpa instead.

EricLBuehler and others added 2 commits October 29, 2024 06:34

Conditional compilation for bf16

49c7255

Vaibhavs10 requested a review from LaurentMazare October 29, 2024 17:58

LaurentMazare reviewed Oct 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some fast Metal MLX SDPA kernels #2584

Add some fast Metal MLX SDPA kernels #2584

EricLBuehler commented Oct 29, 2024 •

edited

Loading

LaurentMazare Oct 30, 2024

LaurentMazare Oct 30, 2024

LaurentMazare Oct 30, 2024

LaurentMazare Oct 30, 2024

LaurentMazare Oct 30, 2024

LaurentMazare Oct 30, 2024

Add some fast Metal MLX SDPA kernels #2584

Are you sure you want to change the base?

Add some fast Metal MLX SDPA kernels #2584

Conversation

EricLBuehler commented Oct 29, 2024 • edited Loading

LaurentMazare Oct 30, 2024

Choose a reason for hiding this comment

LaurentMazare Oct 30, 2024

Choose a reason for hiding this comment

LaurentMazare Oct 30, 2024

Choose a reason for hiding this comment

LaurentMazare Oct 30, 2024

Choose a reason for hiding this comment

LaurentMazare Oct 30, 2024

Choose a reason for hiding this comment

LaurentMazare Oct 30, 2024

Choose a reason for hiding this comment

EricLBuehler commented Oct 29, 2024 •

edited

Loading