Optimization potential #138

BenBE · 2022-01-24T21:37:26Z

Doing some rudimentary profiling I noticed, we are wasting a huge amount of time of the main thread in essentially 3 functions:

alpha_blend (~40%)
convert_rgb_to_yuyv (~25%)
cap.retrieve (~20%)

Values are relative time for main, which takes up roughly the same amount of time, we actually spend processing images (main ~32.75%, bs_maskgen_process ~33.95%, total runtime).

On the positive: bs_maskgen_process spends ~95% of the time waiting for processing by TFLite.

FWIW: Timings heavily affected by running under callgrind, but looking at the code of alpha_blend and convert_rgb_to_yuyv I'm not really surprised of these results. A rewrite of these functions using some vectoring should yield quite a bit of improvement.

The text was updated successfully, but these errors were encountered:

floe · 2022-01-25T08:54:55Z

Ah, that does sound like SSE and friends could do a good job speeding things up. I guess the most portable way to do that is using gcc intrinsics?

BenBE · 2022-01-25T09:09:15Z

Either GCC intrinsics (as portable as they are) or we try to look into things and see if OpenCV/XNN already offers those two methods with an optimized version we could use.

phlash · 2022-01-26T17:14:44Z

I'm aware that those were written because OpenCV doesn't do them.. but happy to be proven wrong if we up the minimum OCV version 😄

Found this Q&A: https://answers.opencv.org/question/211941/how-to-use-simd-feature-of-opencv-as-third-party-library/

I'm also aware that OCV will try to use OpenCL and GPU if they are available (it collided with TFLite when used in multiple threads!).. more research needed!

[edited to add] Found this which looks near for alpha_blend: https://stackoverflow.com/questions/13128696/which-is-the-most-efficient-way-to-do-alpha-mask-in-opencv

There is also libvolk for portable SIMD kernels (as used by Gnuradio and others), which is a standard package on many Linuxen.

BenBE · 2022-01-26T17:26:29Z

I don't have any objections on using a library for those particular vectorized kernels, as doing those optimizations ourselves would be more work than sensible.

Also check if other places in the current code base (e.g. the conv_bias stuff) could benefit from using the library.

martok · 2022-01-29T20:29:11Z

Just talked with @BenBE, time for my regularly scheduled one-off maybe-useful comment :)

alpha_blend (~40%)

OpenCV has a well-hidden function for that, blendLinear. Only difficulty here would be that it requires the mask and anti-mask, both as float.
So the call would look like cv::blendLinear(bg, raw, mask, Scalar::all(1.0)-mask, raw).

I don't actually know if this will be much faster, but they do have vectorised and OCL implementations.

convert_rgb_to_yuyv (~25%)

Really surprised that there is no convert_rgb_to_yuyv equivalent in cv, but it seems that unrolling the loop a few times in a way that explicitly looks like interleaved moves could be enough to get the compiler to generate the vectorised version without manual trickery.

phlash · 2022-01-29T21:48:15Z

OK! Thanks for the pointer towards cv::blendLinear, I've just tried it and the performance gain seems marginal (~0.8 of previous run time). Digging in, I find that ~1/3 of the time is converting/inverting the mask, the rest blending, so it seems blendLinear is not implicitly using SIMD/OCL? Going back to local loop in a separate source file and playing with gcc intrinsics..

BenBE · 2022-01-29T22:15:19Z

If the initial conversion from uint8_t to float takes so much time, there is a good chance we can remove that conversion time when generating the mask in the first place, as the output from TFlite actually is float already, thus we already spend time to create the uint8_t version, but could just skip this part …

phlash · 2022-01-30T13:27:10Z

Yep - when we get down to shaving a few % that's worth it - looking for 10x improvement through SIMD first 😄

phlash · 2022-01-30T14:52:47Z

So: I tried MMX, loading a pixel at a time (3 channels) into an _m64 executing the multiplies, adds & a shift to divide down, then extracted the pixel back to memory - at least 5x slower than a loop 😞

I went back to the original loop code, separated into it's own source file and applied -march=native -O9, 3-4x faster 😄 This will do for me - it's easier than SIMD, portable and guaranteed not to be worse than we have?

BenBE · 2022-01-30T16:05:54Z

Doing some proper calculations, it's quite obvious from hardware architecture alone that loading 3 bytes at a time into one MMX register for processing is not gonna cut it: You'll have at least tons of unaligned memory accesses and half the pipeline usually unoccupied. Given that we know based on the size restrictions introduced by MJPG from the capture device, we know we'll always have a multiple of 16 pixels*, thus could unroll the per-pixel operation 16x and process 48 bytes at a time.

If we then use some SSE magic, we get an inner kernel that could perform this blending operation with the constraints of just 7 registers** needed.

// Setup:
uint8_t* a,b; // Source image
uint8_t* c; // Mask image
uint8_t* d; // Destination image

// Kernel (advance by a+=48; b+=48; c+=16; d+=48; per iteration; prefetch linerar)
xmm0 = *(a + 0);
xmm1 = *(a + 16);
xmm2 = *(a + 32);
separateRGB(xmm0, xmm1, xmm2 --(xmm6)--> xmm0, xmm1, xmm2);

xmm3 = *(b + 0);
xmm4 = *(b + 16);
xmm5 = *(b + 32);
separateRGB(xmm3, xmm4, xmm5 --(xmm6)--> xmm3, xmm4, xmm5);

xmm6 = *(c + 0); // ***

xmm0 = xmm0 * xmm6 >> 8;
xmm1 = xmm1 * xmm6 >> 8;
xmm2 = xmm2 * xmm6 >> 8;

xmm6 = ~xmm6;

xmm3 = xmm3 * xmm6 >> 8;
xmm4 = xmm4 * xmm6 >> 8;
xmm5 = xmm5 * xmm6 >> 8;

xmm0 += xmm3;
xmm1 += xmm4;
xmm2 += xmm5;

mergeRGB(xmm0, xmm1, xmm2 --(xmm6)--> xmm0, xmm1, xmm2);
*(d + 0) = xmm0;
*(d + 16) = xmm1;
*(d + 32) = xmm2;

Proper pipelining of the instructions inside the kernel loop should allow for the actual memory fetches and register assignments to run with minimal congestion.

Actually we get multiples of 64 pixels, as width and height of the image are both restricted to multiples of 8. We currently don't enforce this, but the backing camera device will given we mostly use MJPG or related codecs for video input.

** Technically we have 8 available for SSE, but with clever ordering of things you might be able to get down to only requiring 7 to hold all (intermediate) data.

*** If you happen to get this move scheduled before loading the second image you could start linear-blending the source image while still loading the data for the target image.

phlash · 2022-02-01T17:09:09Z

Sounds like you know waay more than I ever will @BenBE 😄 I've never looked into SIMD stuff before now, so my feeble attempt was through cut/paste/fixup from stack overflow of an example MMX alpha blend function.

My current pr in #139 reduces the blend time to just under 1ms for the majority of executions (I'm simply using the execution timing built into the app), I would be delighted to see SIMD code based on the above analysis, then we can decide if the additional complexity is worth the time saving?

BenBE · 2022-02-01T17:59:10Z

Just noticed you could even skip the channel reordering, because the pixel scaling is uniform across all channels, which makes the kernel even simpler …

Will have to look into the actual GCC intrinsic stuff though.

BenBE added help wanted Extra attention is needed experimental Things we want to play with outside the stable build chore Things that need to happen but don't change the behaviour labels Jan 24, 2022

phlash mentioned this issue Jan 30, 2022

Performance enhancements #139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization potential #138

Optimization potential #138

BenBE commented Jan 24, 2022

floe commented Jan 25, 2022

BenBE commented Jan 25, 2022

phlash commented Jan 26, 2022 •

edited

Loading

BenBE commented Jan 26, 2022

martok commented Jan 29, 2022

phlash commented Jan 29, 2022

BenBE commented Jan 29, 2022

phlash commented Jan 30, 2022

phlash commented Jan 30, 2022

BenBE commented Jan 30, 2022

phlash commented Feb 1, 2022

BenBE commented Feb 1, 2022

Optimization potential #138

Optimization potential #138

Comments

BenBE commented Jan 24, 2022

floe commented Jan 25, 2022

BenBE commented Jan 25, 2022

phlash commented Jan 26, 2022 • edited Loading

BenBE commented Jan 26, 2022

martok commented Jan 29, 2022

phlash commented Jan 29, 2022

BenBE commented Jan 29, 2022

phlash commented Jan 30, 2022

phlash commented Jan 30, 2022

BenBE commented Jan 30, 2022

phlash commented Feb 1, 2022

BenBE commented Feb 1, 2022

phlash commented Jan 26, 2022 •

edited

Loading