GPU Challenges

This wiki is meant for developers that are trying to parallelize portions of the code using a GPU. This page will document potential areas of improvement, challenges, and potential solutions.

Areas of Improvement

The area that would benefit the most with using a GPU for processing is the GaussianFitter. This is due to most of the time of the program comes from the GaussianFitter class and there are a lot of mathematical operations being done when fitting a waveform. With this class, the Peaks class will have to have a GPU version since it is used in GaussianFitter.

Challenges

GSL not on GPU

The GSL library used for GaussianFitter calculations is currently only compiled for x86 architecture. There is a partial port to the GPU of this library called cuSL; however, this port does not contain the gsl_multifit_nlinear functions. Therefore, these functions would need to be ported over to the GPU before the GaussianFitter class can be ported.

GaussianFitter on GPU

When porting over the GaussianFitter functions, there are certain functions that are easy than others. The functions func_df, func_fvv, func_f, guess_peaks, and calculateFirstDifferences do not contain race conditions. The function gaussian_sum has a shared variable called sum that would introduce without making that variable atomic, which makes parallelizing less effective.

One way to get around using atomic sum variable in gaussian_sum would be to chunk out one of the for loops into a GPU thread and have that thread calculate a portion of the sum, then combine them together to return the whole sum.

Additionally, there are a couple of options with using the GPU and this class. One way would be to avoid parallelizing the GSL functions and focus on the calculation helper functions that are being passed to the GSL library as function pointers. With this, the developer can avoid porting the GSL code. The only functions that would need to be ported to the GPU from GaussianFitter would be func_df, func_fvv, func_f, and gaussian_sum. The GSL functions ported with this strategy will be gsl_matrix_set, gsl_vector_set, and gsl_vector_get. The other option would be using an open source library called Gpufit. This library uses the Levenberg-Marquardt curve-fitting algorithm.

Limit cudaMalloc and cudaFree calls

Most of the time when using a GPU is when allocating memory on the GPU and then copying the data over to the GPU. Therefore, the number of calls with cudaMalloc and cudaFree should be limited. Additionally, cudaMemcpy should be limited if possible, but this does not take nearly as long as allocating and freeing memory.

One potential solution for this will be to allocate memory for a waveform once before processing. Then, the program will copy memory when a new waveform to fit is being processed. Another idea is to process the waveforms in a batch of 'x' number of waveforms. This will keep the memory copying from CPU to GPU and vice versa to a minimum increasing the performance.

Using threads with a single GPU

The main challenge is when there are multiple threads trying to pass data and call GPU kernels. This is a problem because a GPU expects the calling CPU thread to have the current context. This context tells the GPU that this thread is in charge of making calls.

There are a couple ways to get around this issue. One is to setup a task queue that is the only thread that interacts with the GPU and all other threads send a task (or waveform in this case) to this queue. The problem with this solution is that this will basically make the program act single threaded again, which defeats the purpose of parallelism in the first place (Note: this is difficult to say if there will be a performance increase or not with this idea). Another way would be to have a thread-safe context class that threads are able to call to reserve the GPU when the need some calculations done and then say the GPU is free when the calculation is finished.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly