avcodec/x86/vp3dsp: Port ff_put_vp_no_rnd_pixels8_l2_mmx to SSE2

This allows to use pavgb to reduce the amount of instructions used
to calculate the average; processing two rows via movhps allows
to reduce the amount of pxor and pavgb even further and turned
out to be beneficial.
This patch also avoids a load as the constant used here can be easily
generated at runtime.

Old benchmarks:
put_no_rnd_pixels_l2_c:                                 13.3 ( 1.00x)
put_no_rnd_pixels_l2_mmx:                               11.6 ( 1.15x)

New benchmarks:
put_no_rnd_pixels_l2_c:                                 13.4 ( 1.00x)
put_no_rnd_pixels_l2_sse2:                               7.5 ( 1.77x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This commit is contained in:
Andreas Rheinhardt
2026-04-12 22:39:04 +02:00
parent 37bc3a237b
commit 84b9de0633
3 changed files with 28 additions and 36 deletions

View File

@@ -47,7 +47,7 @@ static void vp3_check_put_no_rnd_pixels_l2(const VP3DSPContext *const vp3dsp)
BUF_SIZE = MAX_STRIDE * (HEIGHT - 1) + WIDTH,
SRC_BUF_SIZE = BUF_SIZE + (WIDTH - 1), ///< WIDTH-1 to use misaligned input
};
declare_func_emms(AV_CPU_FLAG_MMX, void, uint8_t *dst,
declare_func(void, uint8_t *dst,
const uint8_t *a, const uint8_t *b,
ptrdiff_t stride, int h);