Replies: 2 comments 7 replies
-
Ported IDCT code: https://gist.github.com/br3aker/6480df12bcbaad4940ea0cabb9a4f6a5 Example Benchmark: public class DCT_Benchmark
{
Block8x8[] inputInt16 = new Block8x8[5000];
Block8x8F[] inputFloat32 = new Block8x8F[5000];
[Benchmark(Description = "Float IDCT")]
public void FloatBlock()
{
for(int i = 0; i < inputFloat32.Length; i++)
{
DCT_ImageSharp.IDCT8x8_Avx(ref inputFloat32[i]);
}
}
[Benchmark(Description = "16bit IDCT")]
public void Fast16bit()
{
for (int i = 0; i < inputInt16.Length; i++)
{
DCT_16bit.IDCT_Avx_16bit(ref Unsafe.As<Block8x8, short>(ref inputInt16[i]));
}
}
} |
Beta Was this translation helpful? Give feedback.
0 replies
-
Yeah this is unexpected to me as well, but I would have to dig very deep to see if I have any ideas regarding the particular implementation. I would better say that it's not worth it, and agree that we can stay with floating point for now. Also agree that we should shift our focus elsewhere, I would guess it's Huffman decoding where the biggest potential is today. Tagging @saucecontrol in case if he's around. |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi!
I've ported libjpeg-turbo fixed point IDCT for performance testing and actually got a lot worse performance results:
With 2 transpose calls / 5000 8x8 blocks
With 1 transpose call (current ImageSharp implementation) / 5000 8x8 blocks
Yes, 32-bit float IDCT has twice as much data to be processed so throughput should be a lot better but it's not. 'Backbone' of this implementation is the fact that AVX has this FMA-like combined addition and multiplication which takes 16-bit operands and produces 32-bit output. While it's cool to do add/mul at the half cost of a usual float add/mul operation we lose 16-bit values so we need to do this twice for low and high halves and repack them back from 32-bit to 16-bit which takes a lot of extra time on top of float implementation.
The only beneficial thing for 16-bit fixed point in transpose which is a lot faster than 32-bit float transpose but we only have 1 transpose call per IDCT call instead of 2 (FDCT will be implemented soon, I promise).
This is counter intuitive, I though less page faults and less memory pressure would do the trick but it's simply slower. First I though that JIT did a bad job at IL->asm level but it's OK at first glance, no random vector sets or other symptoms of slow simd asm code. Even if there's a couple of redundant instructions it certainly won't negate that performance difference. And does 2x lower memory consumption really matter nowadays? Especially with baseline jpegs which don't store entire spectral buffer?
Yet another problem with 16-bit DCTs is that they can work only in 8-bit sample path, i.e. 12-bit jpegs must use float DCT. It basically means we would need 2 spectral -> color converters or extra virtual call. Add on top of that possible scaling converters which produce 4x4/2x2/1x1 color blocks from 8x8 spectral block for example. Each of them would need separate float and 16-bit int implementation.
So here's my somewhat of a proposal: get over it IMO, float IDCT/FDCT routines are fast and simple. There's at least 2 other ways to get a lot better performance than this fixed-point shenanigans which I'll explain a bit later - a bit busy atm.
@antonfirsov @JimBobSquarePants
Beta Was this translation helpful? Give feedback.
All reactions