blas2cuda

This is a library to intercept calls to CPU BLAS kernels and run their equivalent on the GPU in a CUDA environment.

Compiling

$ export CUDA=...
$ meson -DCUDA=$CUDA build && ninja -C build

How it works

TODO: expand this section

Allocation tracking

done once
track ALL object allocations
- remove object-specific information, giving us only calls to malloc()
saved to file

Object tracking

in code (blas2cuda):
- define custom object manager (struct objmngr)
- any time an allocation file is loaded, we use this memory manager
any time malloc() is called:
1. object tracker compares call info (from requested size and using libunwind to get instruction pointer) with allocation list
2. if call info matches, allocate the object using the custom memory manager defined for the call, and track the object
3. if call info doesn't match, act normally

Motivation for object tracking

Each time a kernel is called, we would have to copy data to GPU, invoke the kernel, and copy it back to the CPU. For a series of calls to kernels that aren't computation-intensive (Level 1 and Level 2 BLAS calls are vector-vector and matrix-vector operations), throughput is significantly degraded as the time to transfer data dominates computation.
- This is why NVBLAS, a similar project, only intercepts computation-intensive Level 3 matrix-matrix operations, where the computation dominates data transfer.
- However, there's still this issue of copying back and forth.
blas2cuda uses object tracking to distinguish memory objects that are used in BLAS kernels from other memory objects we don't care about.
When a call is made to malloc() that we should care about, we use cudaMallocManaged() instead and return a memory address that is shared between the CPU and GPU. This memory is a managed object, and a later call to free() will use cudaFree() instead.
By intercepting the right calls, we can tell when these memory objects are later used in kernels, and avoid copying.
Instead of explicit copying, a page faulting mechanism is used to move data between the CPU and GPU.

(Outdated) Running a program

./blas2cuda.sh <objtrackfile> <program>

Name		Name	Last commit message	Last commit date
Latest commit History 414 Commits
blas_level1		blas_level1
blas_level2		blas_level2
blas_level3		blas_level3
data		data
lib		lib
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
blas.h		blas.h
blas2cuda.c		blas2cuda.c
blas2cuda.h		blas2cuda.h
cblas.h		cblas.h
clblas_ext.py		clblas_ext.py
clext.py		clext.py
common.h		common.h
conversions.h		conversions.h
entry.c		entry.c
meson.build		meson.build
meson_options.txt		meson_options.txt
runtime-blas.c		runtime-blas.c
runtime-blas.h		runtime-blas.h
runtime-mem.hpp		runtime-mem.hpp
runtime.c		runtime.c
runtime.h		runtime.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blas2cuda

Compiling

How it works

Allocation tracking

Object tracking

Motivation for object tracking

(Outdated) Running a program

About

Releases

Packages

Languages

License

Prince781/libgpublas

Folders and files

Latest commit

History

Repository files navigation

blas2cuda

Compiling

How it works

Allocation tracking

Object tracking

Motivation for object tracking

(Outdated) Running a program

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages