Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No way to copy a tensor from gpu to cpu to pre allocated array. #1388

Open
LukePoga opened this issue Oct 19, 2024 · 4 comments
Open

No way to copy a tensor from gpu to cpu to pre allocated array. #1388

LukePoga opened this issue Oct 19, 2024 · 4 comments

Comments

@LukePoga
Copy link

Doesnt appear to be any way to transfer a result tensor in to an existing cpu float array. Below requires new memory allocation.

    var cpuResult = gpuResult.cpu();
    float[] result = cpuResult.data<float>().ToArray();

If this is part of a loop, this is a lot of wasted memory allocation and time! Below is how libraries normally do things. eg. CUDA.

float[] cpuResult ..... (pre allocated further up)
gpuResult.CopyToHost(cpuResult);

Maybe I missed this CopyTo because its kinda essential for any gpu type library (?!)

Is this project maintained?

@haytham2597
Copy link

haytham2597 commented Oct 21, 2024

One way to make a "fast" copy without overloading memory is make contiguous and add this on TensorAccessor.cs in function ToArray()

if (_tensor.is_contiguous()) {
    //This is very fast. And work VERY WELL
    var shps = _tensor.shape;
    long TempCount = 1;
    for (int i = 0; i < shps.Length; i++)
        TempCount *= shps[i]; //Theorically the numel is simple as product of each element shape
    unsafe {
        return new Span<T>(_tensor_data_ptr.ToPointer(), Convert.ToInt32(TempCount)).ToArray();
    }
}

I Added these in one comit of my Pull Request Autocast. I try to figure out how make same idea if the tensor is not contiguous. Because this way for faster copy i always i need make the tensor as contiguous.

torch.Tensor te /*blablabla*/;
te = te.contiguous().data<float>().ToArray()

I noticed that if the tensor is not contiguous call always the method Numel so always computed.

Edit: Oh sorry i misunderstood what you mean, i think with CopyTo will work. You mean like this?

float[] data = new float[h*w*3]; //PreAllocated in top of function for example

//Intense functions and process blablabla

tenGPU.data<float>().CopyTo(data); //`tenGPU` is a variable of torch.Tensor that is allocated in GPU

I will test this. If that not work, soon i investigate how do that.

@haytham2597
Copy link

I recently test this and work well.
Image

@LukePoga
Copy link
Author

Great thanks. I don't know why I didnt see CopyTo before.

tenGPU.data<float>().CopyTo(data); 

But its not faster. This takes 340ms for 12,000,000 floats. This is 150MB/s which is extremely slow for PCIE bandwidth. whys it so slow?

@haytham2597
Copy link

@LukePoga
Because in TensorAccessor.cs L41 Can see that call _tensor.numel() and inside of loop in GetSubsequentIndices for example, call always the Numel. That use so much CPU and may slow too, so many times that call the function and also iterate over Ptr array one by one assign in preallocated array. So my solution was modified that TensorAccessor for fast copy but only will work if the tensor is contiguous so before CopyTo or ToArray() should create a contiguous tensor like this:

torch.Tensor tenGPU;
//Blablabla
tenGPU = tenGPU.contiguous();
//After that you can call tenGPU.data<T>().ToArray() or a CopyTo.
tenGPU.data<T>().ToArray() //Or CopyTo

My Fast TensorAccessor pre-compute the Numel 1 time (that is multiply all element) and then create a complete copy without loop Ptr or Pointer Array

//From my branch of TorchSharp/Utils/TensorAccessor.cs
unsafe {
    return new Span<T>(_tensor_data_ptr.ToPointer(), Convert.ToInt32(TempCount)).ToArray();
}

That this not iterate over array and assign value on index. This create a complete copy.

Soon i will make a PR for a Fast TensorAccessor but reminder that only will work fast if tensor is contiguous.
Now for the not contiguous i need to see how figure out, maybe it can be a bit quick with pre-compute Numel.
Because not contiguous is more complex due a Stride.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants