Fixed point arithmetic for JPEG #1943

br3aker · 2022-01-15T05:09:20Z

br3aker
Jan 15, 2022

Hi!
I've ported libjpeg-turbo fixed point IDCT for performance testing and actually got a lot worse performance results:

With 2 transpose calls / 5000 8x8 blocks

Method	Mean	Error	StdDev
'Float IDCT'	140.7 us	1.13 us	1.00 us
'16bit IDCT'	202.9 us	1.47 us	1.37 us

With 1 transpose call (current ImageSharp implementation) / 5000 8x8 blocks

Method	Mean	Error	StdDev
'Float IDCT'	114.9 us	0.58 us	0.52 us
'16bit IDCT'	183.9 us	3.42 us	3.20 us

Yes, 32-bit float IDCT has twice as much data to be processed so throughput should be a lot better but it's not. 'Backbone' of this implementation is the fact that AVX has this FMA-like combined addition and multiplication which takes 16-bit operands and produces 32-bit output. While it's cool to do add/mul at the half cost of a usual float add/mul operation we lose 16-bit values so we need to do this twice for low and high halves and repack them back from 32-bit to 16-bit which takes a lot of extra time on top of float implementation.

The only beneficial thing for 16-bit fixed point in transpose which is a lot faster than 32-bit float transpose but we only have 1 transpose call per IDCT call instead of 2 (FDCT will be implemented soon, I promise).

This is counter intuitive, I though less page faults and less memory pressure would do the trick but it's simply slower. First I though that JIT did a bad job at IL->asm level but it's OK at first glance, no random vector sets or other symptoms of slow simd asm code. Even if there's a couple of redundant instructions it certainly won't negate that performance difference. And does 2x lower memory consumption really matter nowadays? Especially with baseline jpegs which don't store entire spectral buffer?

Yet another problem with 16-bit DCTs is that they can work only in 8-bit sample path, i.e. 12-bit jpegs must use float DCT. It basically means we would need 2 spectral -> color converters or extra virtual call. Add on top of that possible scaling converters which produce 4x4/2x2/1x1 color blocks from 8x8 spectral block for example. Each of them would need separate float and 16-bit int implementation.

So here's my somewhat of a proposal: get over it IMO, float IDCT/FDCT routines are fast and simple. There's at least 2 other ways to get a lot better performance than this fixed-point shenanigans which I'll explain a bit later - a bit busy atm.

@antonfirsov @JimBobSquarePants

br3aker · 2022-01-15T05:12:16Z

br3aker
Jan 15, 2022
Author

Ported IDCT code: https://gist.github.com/br3aker/6480df12bcbaad4940ea0cabb9a4f6a5

Example Benchmark:

public class DCT_Benchmark
{
    Block8x8[] inputInt16 = new Block8x8[5000];

    Block8x8F[] inputFloat32 = new Block8x8F[5000];

    [Benchmark(Description = "Float IDCT")]
    public void FloatBlock()
    {
        for(int i = 0; i < inputFloat32.Length; i++)
        {
            DCT_ImageSharp.IDCT8x8_Avx(ref inputFloat32[i]);
        }
    }

    [Benchmark(Description = "16bit IDCT")]
    public void Fast16bit()
    {
        for (int i = 0; i < inputInt16.Length; i++)
        {
            DCT_16bit.IDCT_Avx_16bit(ref Unsafe.As<Block8x8, short>(ref inputInt16[i]));
        }
    }
}

0 replies

antonfirsov · 2022-01-17T15:28:41Z

antonfirsov
Jan 17, 2022
Maintainer

Yeah this is unexpected to me as well, but I would have to dig very deep to see if I have any ideas regarding the particular implementation. I would better say that it's not worth it, and agree that we can stay with floating point for now. Also agree that we should shift our focus elsewhere, I would guess it's Huffman decoding where the biggest potential is today.

Tagging @saucecontrol in case if he's around.

7 replies

saucecontrol Jan 17, 2022

Nice! I haven't had time to dig in to your updates, but I'm looking forward to it. Might want to close #1607 if you've already done all that can be done with the stream buffer.

antonfirsov Jan 17, 2022
Maintainer

I particularly meant experimenting with slow/fast path separation ideas from #1607.

antonfirsov Jan 17, 2022
Maintainer

Ah lol, @saucecontrol's reply appeared with a delay in my browser.

br3aker Jan 17, 2022
Author

Ups sorry, I've meant I've done it on experimental local branch without any PRs:
https://github.com/br3aker/ImageSharp/tree/dp/huffman-scan-decoder-optimization

This implementation preloads a byte buffer for entire worst case block (actually, it preloads a little bit more bytes due to the possible 0xFF bytes).

So it basically does 1 stream load call per 1 block in the worst case possible - no performance change.

JimBobSquarePants Jan 18, 2022
Maintainer

So here's my somewhat of a proposal: get over it IMO, float IDCT/FDCT routines are fast and simple.

Completely agree @br3aker Let's leave this and focus elsewhere. Thanks for having a crack at it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed point arithmetic for JPEG #1943

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Fixed point arithmetic for JPEG #1943

br3aker Jan 15, 2022

Replies: 2 comments · 7 replies

br3aker Jan 15, 2022 Author

antonfirsov Jan 17, 2022 Maintainer

saucecontrol Jan 17, 2022

antonfirsov Jan 17, 2022 Maintainer

antonfirsov Jan 17, 2022 Maintainer

br3aker Jan 17, 2022 Author

JimBobSquarePants Jan 18, 2022 Maintainer

br3aker
Jan 15, 2022

Replies: 2 comments 7 replies

br3aker
Jan 15, 2022
Author

antonfirsov
Jan 17, 2022
Maintainer

antonfirsov Jan 17, 2022
Maintainer

antonfirsov Jan 17, 2022
Maintainer

br3aker Jan 17, 2022
Author

JimBobSquarePants Jan 18, 2022
Maintainer