TP sharding v2 #216

Narsil · 2023-07-21T15:11:09Z

Second update version of TP sharding.

This WAS NOT tested for accuracy (beacuse I didn't implement all_reduce)
Currently waiting for a simple API to get the cuda storage so I can call all_reduce directly and recreate a tensor from that. (That way I don't just pub everything)

I tried to keep the modifications minimal in var_builder for now as this is not the purpose of this PR.

Narsil · 2023-07-26T10:25:59Z

candle-examples/examples/llama_multiprocess/model.rs

+struct AllReduce {
+    comm: Rc<Comm>,
+}
+
+/// This is actually not safe: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/threadsafety.html
+/// But for this example purposes, this will work
+unsafe impl Sync for AllReduce {}
+/// This is actually not safe: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/threadsafety.html
+/// But for this example purposes, this will work
+unsafe impl Send for AllReduce {}
+
+impl CustomOp1 for AllReduce {
+    fn name(&self) -> &'static str {
+        "allreduce"
+    }
+
+    fn cpu_fwd(&self, _s: &CpuStorage, _l: &Layout) -> Result<(CpuStorage, Shape)> {
+        todo!("implement allreduce for cpu is not necessary for single node");
+    }
+
+    #[cfg(feature = "cuda")]
+    fn cuda_fwd(
+        &self,
+        s: &candle::CudaStorage,
+        l: &Layout,
+    ) -> Result<(candle::CudaStorage, Shape)> {
+        use candle::cuda_backend::WrapErr;
+        let elem_count = l.shape().elem_count();
+        let dev = s.device().clone();
+        let s = s.as_cuda_slice::<f16>()?;
+        // let s = match l.contiguous_offsets() {
+        //     None => Err(Error::Wrapped("input has to be contiguous".into()))?,
+        //     Some((o1, o2)) => s.slice(o1..o2),
+        // };
+        let mut dst = unsafe { dev.alloc::<f16>(elem_count) }.w()?;
+        self.comm.all_reduce(s, &mut dst, &ReduceOp::Sum).unwrap();
+        let dst = candle::CudaStorage::wrap_cuda_slice(dst, dev);
+        Ok((dst, l.shape().clone()))
+    }
+}
+
+fn all_reduce_sum(x: &Tensor, comm: &Rc<Comm>) -> Result<Tensor> {
+    x.custom_op1(AllReduce { comm: comm.clone() })
+}
+


This is the core of the new thing.

TP Row/Col

candle-nn/src/var_builder.rs

Narsil requested a review from LaurentMazare July 21, 2023 15:11

Narsil mentioned this pull request Jul 21, 2023

TP sharded - Multiprocess #107

Closed

Narsil force-pushed the llama_multiprocess2 branch from 7ac49b5 to d384f9f Compare July 25, 2023 19:29

Narsil commented Jul 26, 2023

View reviewed changes

LaurentMazare reviewed Jul 27, 2023

View reviewed changes

candle-nn/src/var_builder.rs Outdated Show resolved Hide resolved

candle-nn/src/var_builder.rs Outdated Show resolved Hide resolved

candle-nn/src/var_builder.rs Show resolved Hide resolved

candle-nn/src/var_builder.rs Show resolved Hide resolved

Narsil added 6 commits July 27, 2023 09:58

TP sharding v2

1735e48

Fixed TP sharded version.

ed58de7

PyO3 is back.

b7814f6

Tensor are not necessarily sendable (CustomOp1).

1553b58

Removing inner dependency on safetensors.

7c7e6ba

Putting back Send + Sync

25a2086

Narsil force-pushed the llama_multiprocess2 branch from ee41ce7 to 25a2086 Compare July 27, 2023 09:26

Fixing slice errors + comments.

952eca6

Narsil requested a review from LaurentMazare July 27, 2023 15:03

Added comment about offsets.

8435a99

LaurentMazare approved these changes Jul 28, 2023

View reviewed changes

Narsil merged commit 4f260ef into main Jul 28, 2023
10 checks passed

LaurentMazare deleted the llama_multiprocess2 branch August 15, 2023 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP sharding v2 #216

TP sharding v2 #216

Narsil commented Jul 21, 2023 •

edited

Loading

Narsil Jul 26, 2023

TP sharding v2 #216

TP sharding v2 #216

Conversation

Narsil commented Jul 21, 2023 • edited Loading

Narsil Jul 26, 2023

Choose a reason for hiding this comment

Narsil commented Jul 21, 2023 •

edited

Loading