Commit Graph

52871 Commits

Author SHA1 Message Date
Manuel Lauss
3945d100ef avcodec/sanm: remove unused SANMFrameHeader
Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
2ef26c30eb avcodec/sanm: implement BL16 subcodecs 1 and 7
Both of these encode a quarter-sized keyframe, with missing pixels
interpolated from the immediate neighbours.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
b1a7f8b7cf avcodec/sanm: factor out the ANIM decoding into separate function
Mainly for readability. No functional changes.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
49c552d066 avcodec/sanm: restructure SANM like the other block codecs
Restructe the SANM (or BL16 as LucasArts calls it) decoder to make it
look like the others, as it is basically a development of old_codec47
for rgb565 values.

No functional changes.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
4d5e87eaa4 Revert "avcodec/sanm: Check w,h,left,top"
This reverts commit 134fbfd1dc.

As it breaks valid uses of this in Rebel Assault 1 videos.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
75b6937527 avcodec/sanm: reset rotate_code every iteration
and eliminate the explicit reset in the other decoders that
don't need it.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
de7db62acc avcodec/sanm: rename process_block to codec47_block
the new name better indicates where it belongs to.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
043dafc4c2 avcodec/sanm: codec37/47/48 size checks
Add more size checks to old_codec37/47/48, esp. the headers.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
f98cd66b4b avcodec/sanm: codec47: read the small codebook
codec47 carries a 4-byte small codebook in its header. Read those
4 bytes into context member instead of awkwardly redirecting the
bytestream pointer every time it needs to be accessed.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
72e6206c88 avcodec/sanm: partially fix codec48
The mv check introduced with d5bdb0b705 broke MotS videos:
- their height (300 lines) is 37,5 blocks; unfortunately the videos try to
  access up to 1 block more.
  Extend the mv check to the aligned_height, which fixes most artifacts.
- don't return an error when an mv is invalid; rather skip the (subblock).
  Gets rid of almost all artifacts.

Some artifacts still remain, esp in space scenes where the original
encoder apparently fetched black pixels from outside of the aligned
height.  An increase of the buffer size by 8 lines will fix that later.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
24ce42b406 avcodec/sanm: codec4 improvements
- don't draw outside the buffers
- don't wrap around when coordinates go over the edge

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
dfe4a0626f avcodec/sanm: codec31 improvements
- don't draw outside the buffers
- don't wrap around when coordinates go over the edge
  this is especially noticeable in the e.g. O1OPEN.ANM, C1C3PO.ANM
  RA1 files with planets wrapping around.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:29:00 +02:00
Manuel Lauss
da4b88494c avcodec/sanm: codec1 improvements
- don't draw outside the buffers
- don't wrap around when coordinates go over the edge
  this is especially noticeable in the e.g. O1OPEN.ANM, C1C3PO.ANM
  RA1 files with planets wrapping around.

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:28:59 +02:00
Manuel Lauss
d18c25f1a9 avcodec/sanm: codec21 improvements
- don't draw outside the buffers
- don't wrap around when coordinates go over the edge

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:28:59 +02:00
Manuel Lauss
67b28acba3 avcodec/sanm: codec23 improvements
- don't draw outside the buffers
- don't wrap around when coordinates go over the edge

Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>
2025-10-09 10:28:59 +02:00
James Almer
4377affc28 avcodec/hevc/refs: don't unconditionally discard non-IRAP frames if no IRAP frame was seen before
Should fix issue #20661

Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-09 02:52:46 +00:00
Andreas Rheinhardt
378d5bb08a avcodec/x86/fpel: Add blocksize x blocksize avg/put functions
This commit deduplicates the wrappers around the fpel functions
for copying whole blocks (i.e. height equaling width). It does
this in a manner which avoids having push/pop function arguments
when the calling convention forces one to pass them on the stack
(as in 32bit systems).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-08 20:40:53 +02:00
Andreas Rheinhardt
ad498f9759 avcodec/x86/cavsdsp: Remove MMXEXT Qpeldsp
Superseded by SSE2. Saves about 11630B here.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-08 20:40:08 +02:00
Andreas Rheinhardt
650098955e avcodec/x86/cavs_qpel: Add SSE2 vertical motion compensation
This is not based on the MMXEXT one, because the latter is quite
suboptimal: Motion vector types mc01 and mc03 (vertical motion vectors
with remainder of one quarter or three quarter) use different neighboring
lines for interpolation: mc01 uses two lines above and two lines below,
mc03 one line above and three lines below. The MMXEXT code uses
a common macro for all of them and therefore reads six lines
before it processes them (even reading lines which are not used
at all), leading to severe register pressure.

Another difference to the old code is that the positive and negative
parts of the sum to calculate are accumulated separately and
the subtraction is performed with unsigned saturation, so
that one can avoid biasing the sum.

The fact that the mc01 and mc03 filter coefficients are mirrors
of each other has been exploited to reduce mc01 to mc03.

But of course the most important different difference between
this code and the MMXEXT one is that XMM registers allow to
process eight words at a time, ideal for 8x8 subblocks,
whereas the MMXEXT code processes them in 4x8 or 4x16 blocks.

Benchmarks:
avg_cavs_qpel_pixels_tab[0][4]_c:                      917.0 ( 1.00x)
avg_cavs_qpel_pixels_tab[0][4]_mmxext:                 222.0 ( 4.13x)
avg_cavs_qpel_pixels_tab[0][4]_sse2:                    89.0 (10.31x)
avg_cavs_qpel_pixels_tab[0][12]_c:                     885.7 ( 1.00x)
avg_cavs_qpel_pixels_tab[0][12]_mmxext:                223.2 ( 3.97x)
avg_cavs_qpel_pixels_tab[0][12]_sse2:                   88.5 (10.01x)
avg_cavs_qpel_pixels_tab[1][4]_c:                      222.4 ( 1.00x)
avg_cavs_qpel_pixels_tab[1][4]_mmxext:                  57.2 ( 3.89x)
avg_cavs_qpel_pixels_tab[1][4]_sse2:                    23.3 ( 9.55x)
avg_cavs_qpel_pixels_tab[1][12]_c:                     216.0 ( 1.00x)
avg_cavs_qpel_pixels_tab[1][12]_mmxext:                 57.4 ( 3.76x)
avg_cavs_qpel_pixels_tab[1][12]_sse2:                   22.6 ( 9.56x)
put_cavs_qpel_pixels_tab[0][4]_c:                      750.9 ( 1.00x)
put_cavs_qpel_pixels_tab[0][4]_mmxext:                 210.4 ( 3.57x)
put_cavs_qpel_pixels_tab[0][4]_sse2:                    84.2 ( 8.92x)
put_cavs_qpel_pixels_tab[0][12]_c:                     731.6 ( 1.00x)
put_cavs_qpel_pixels_tab[0][12]_mmxext:                210.7 ( 3.47x)
put_cavs_qpel_pixels_tab[0][12]_sse2:                   84.1 ( 8.70x)
put_cavs_qpel_pixels_tab[1][4]_c:                      191.7 ( 1.00x)
put_cavs_qpel_pixels_tab[1][4]_mmxext:                  53.8 ( 3.56x)
put_cavs_qpel_pixels_tab[1][4]_sse2:                    24.5 ( 7.83x)
put_cavs_qpel_pixels_tab[1][12]_c:                     179.1 ( 1.00x)
put_cavs_qpel_pixels_tab[1][12]_mmxext:                 53.9 ( 3.32x)
put_cavs_qpel_pixels_tab[1][12]_sse2:                   24.0 ( 7.47x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-08 20:40:08 +02:00
Andreas Rheinhardt
74a88c0c11 avcodec/x86/cavsdsp: Add SSE2 mc20 horizontal motion compensation
Basically a direct port of the MMXEXT one. The main difference
is of course that one can process eight pixels (unpacked to words)
at a time, leading to speedups.

avg_cavs_qpel_pixels_tab[0][2]_c:                      700.1 ( 1.00x)
avg_cavs_qpel_pixels_tab[0][2]_mmxext:                 158.1 ( 4.43x)
avg_cavs_qpel_pixels_tab[0][2]_sse2:                    86.0 ( 8.14x)
avg_cavs_qpel_pixels_tab[1][2]_c:                      171.9 ( 1.00x)
avg_cavs_qpel_pixels_tab[1][2]_mmxext:                  39.4 ( 4.36x)
avg_cavs_qpel_pixels_tab[1][2]_sse2:                    21.7 ( 7.92x)
put_cavs_qpel_pixels_tab[0][2]_c:                      525.7 ( 1.00x)
put_cavs_qpel_pixels_tab[0][2]_mmxext:                 148.5 ( 3.54x)
put_cavs_qpel_pixels_tab[0][2]_sse2:                    75.2 ( 6.99x)
put_cavs_qpel_pixels_tab[1][2]_c:                      129.5 ( 1.00x)
put_cavs_qpel_pixels_tab[1][2]_mmxext:                  36.7 ( 3.53x)
put_cavs_qpel_pixels_tab[1][2]_sse2:                    19.0 ( 6.81x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-08 20:40:08 +02:00
Andreas Rheinhardt
cc2e2f12ca avcodec/x86/cavsdsp: Fix vertical qpel motion compensation
The prediction involves terms of the form
(-1 * s0 - 2 * s1 + 96 * s2 + 42 * s3 - 7 * s4 + 64) >> 7,
where the s values are in the range of 0..255.
The sum can have values in the range -2550..35190, which
does not fit into a signed 16bit integer. The code uses
an arithmetic right shift, which does not yield the correct
result for values >= 2^15; such values should be clipped
to 255, yet are clipped to 0 instead.

Fix this by shifting the values by 4096, so that the range
is positive, use a logical right shift and subtract 32.

bunny.mp4 from the FATE suite can be used to reproduce the problem.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-08 20:40:08 +02:00
Andreas Rheinhardt
ec2fe94b3f avcodec/cavs: Remove unused parameter
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-08 20:40:08 +02:00
Michael Niedermayer
7896cc67c1 avcodec/exr: Check that DWA has 3 channels
The implementation hardcodes access to 3 channels, so we need to check that
Fixes: out of array access
Fixes: BIGSLEEP-445394503-crash.exr

Found-by: Google Big Sleep
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2025-10-08 00:27:49 +00:00
Michael Niedermayer
c911e00011 avcodec/exr: Round dc_w/h up
Without rounding them up there are too few dc coeffs for the blocks.
We do not know if this way of handling odd dimensions is correct, as we have
no such DWA sample.
thus we ask the user for a sample if she encounters such a file

Fixes: out of array access
Fixes: BIGSLEEP-445392027-crash.exr

Found-by: Google Big Sleep
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2025-10-08 00:27:49 +00:00
Michael Niedermayer
8e078826da avcodec/exr: check ac_size
Fixes: out of array read
Fixes: dwa_uncompress.py.crash.exr

The code will read from the ac data even if ac_size is 0, thus that case
is not implemented and we ask for a sample and error out cleanly

Found-by: Google Big Sleep

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2025-10-08 00:27:49 +00:00
Baptiste Coudurier
ef60d5ac32 general: fix warning 'av_malloc_array' sizes specified with 'sizeof'
in the earlier argument and not in the later argument [-Wcalloc-transposed-args]

Fixes trac ticket #11620
2025-10-07 14:51:46 -07:00
Andreas Rheinhardt
00225e9ebc avcodec/x86/h264_qpel: Simplify macros
1. Remove the OP parameter from the QPEL_H264* macros. These are
a remnant of inline assembly and were forgotten in
610e00b359.
2. Pass the instruction set extension for the shift5 function
explicitly in the macro instead of using magic #defines.
3. Likewise, avoid magic #defines for (8|16)_v_lowpass_ssse3.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
3049694e9f avcodec/x86/h264_qpel: Split hv2_lowpass_sse2 into size 8,16 funcs
This is beneficial size-wise: 384B of new asm functions are more
than outweighted by 416B savings from simpler calls here (for size 16,
the size 8 function had been called twice).
It also makes the code more readable, as it allowed to remove
several wrappers in h264_qpel.c.

It is also beneficial performance-wise. Old benchmarks:
avg_h264_qpel_16_mc12_8_c:                            1757.7 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          197.7 ( 8.89x)
avg_h264_qpel_16_mc12_8_ssse3:                         204.6 ( 8.59x)
avg_h264_qpel_16_mc21_8_c:                            1631.6 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          276.4 ( 5.90x)
avg_h264_qpel_16_mc21_8_ssse3:                         290.7 ( 5.61x)
avg_h264_qpel_16_mc22_8_c:                            1122.7 ( 1.00x)
avg_h264_qpel_16_mc22_8_sse2:                          179.5 ( 6.25x)
avg_h264_qpel_16_mc22_8_ssse3:                         181.8 ( 6.17x)
avg_h264_qpel_16_mc23_8_c:                            1626.7 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          276.8 ( 5.88x)
avg_h264_qpel_16_mc23_8_ssse3:                         290.9 ( 5.59x)
avg_h264_qpel_16_mc32_8_c:                            1754.1 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          193.8 ( 9.05x)
avg_h264_qpel_16_mc32_8_ssse3:                         203.6 ( 8.62x)
put_h264_qpel_16_mc12_8_c:                            1733.6 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          189.6 ( 9.14x)
put_h264_qpel_16_mc12_8_ssse3:                         199.6 ( 8.69x)
put_h264_qpel_16_mc21_8_c:                            1616.0 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          284.3 ( 5.69x)
put_h264_qpel_16_mc21_8_ssse3:                         296.5 ( 5.45x)
put_h264_qpel_16_mc22_8_c:                             963.7 ( 1.00x)
put_h264_qpel_16_mc22_8_sse2:                          169.9 ( 5.67x)
put_h264_qpel_16_mc22_8_ssse3:                         186.1 ( 5.18x)
put_h264_qpel_16_mc23_8_c:                            1607.2 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          275.0 ( 5.84x)
put_h264_qpel_16_mc23_8_ssse3:                         297.8 ( 5.40x)
put_h264_qpel_16_mc32_8_c:                            1734.7 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          189.4 ( 9.16x)
put_h264_qpel_16_mc32_8_ssse3:                         199.4 ( 8.70x)

New benchmarks:
avg_h264_qpel_16_mc12_8_c:                            1743.7 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          189.7 ( 9.19x)
avg_h264_qpel_16_mc12_8_ssse3:                         204.4 ( 8.53x)
avg_h264_qpel_16_mc21_8_c:                            1637.7 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          267.7 ( 6.12x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.5 ( 5.62x)
avg_h264_qpel_16_mc22_8_c:                            1150.3 ( 1.00x)
avg_h264_qpel_16_mc22_8_sse2:                          164.6 ( 6.99x)
avg_h264_qpel_16_mc22_8_ssse3:                         182.1 ( 6.32x)
avg_h264_qpel_16_mc23_8_c:                            1635.3 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          268.5 ( 6.09x)
avg_h264_qpel_16_mc23_8_ssse3:                         298.5 ( 5.48x)
avg_h264_qpel_16_mc32_8_c:                            1740.6 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          182.6 ( 9.53x)
avg_h264_qpel_16_mc32_8_ssse3:                         201.9 ( 8.62x)
put_h264_qpel_16_mc12_8_c:                            1727.4 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          188.1 ( 9.18x)
put_h264_qpel_16_mc12_8_ssse3:                         199.6 ( 8.65x)
put_h264_qpel_16_mc21_8_c:                            1623.5 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          265.9 ( 6.11x)
put_h264_qpel_16_mc21_8_ssse3:                         299.4 ( 5.42x)
put_h264_qpel_16_mc22_8_c:                             954.0 ( 1.00x)
put_h264_qpel_16_mc22_8_sse2:                          161.8 ( 5.89x)
put_h264_qpel_16_mc22_8_ssse3:                         180.4 ( 5.29x)
put_h264_qpel_16_mc23_8_c:                            1611.2 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          265.8 ( 6.06x)
put_h264_qpel_16_mc23_8_ssse3:                         300.3 ( 5.37x)
put_h264_qpel_16_mc32_8_c:                            1734.5 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          180.0 ( 9.63x)
put_h264_qpel_16_mc32_8_ssse3:                         199.7 ( 8.69x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
3ed590c7b9 avcodec/x86/h264_qpel: Port qpel8or16_hv2_lowpass_op_mmxext to SSE2
This means that only blocksize 4 still uses mmx(ext).

Old benchmarks:
avg_h264_qpel_8_mc12_8_c:                              428.4 ( 1.00x)
avg_h264_qpel_8_mc12_8_sse2:                            74.3 ( 5.77x)
avg_h264_qpel_8_mc12_8_ssse3:                           69.3 ( 6.18x)
avg_h264_qpel_8_mc21_8_c:                              401.4 ( 1.00x)
avg_h264_qpel_8_mc21_8_sse2:                            97.8 ( 4.10x)
avg_h264_qpel_8_mc21_8_ssse3:                           93.7 ( 4.28x)
avg_h264_qpel_8_mc22_8_c:                              281.8 ( 1.00x)
avg_h264_qpel_8_mc22_8_sse2:                            66.7 ( 4.23x)
avg_h264_qpel_8_mc22_8_ssse3:                           62.6 ( 4.50x)
avg_h264_qpel_8_mc23_8_c:                              397.2 ( 1.00x)
avg_h264_qpel_8_mc23_8_sse2:                            97.9 ( 4.06x)
avg_h264_qpel_8_mc23_8_ssse3:                           93.7 ( 4.24x)
avg_h264_qpel_8_mc32_8_c:                              432.4 ( 1.00x)
avg_h264_qpel_8_mc32_8_sse2:                            73.9 ( 5.85x)
avg_h264_qpel_8_mc32_8_ssse3:                           69.5 ( 6.22x)
avg_h264_qpel_16_mc12_8_c:                            1756.4 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          240.0 ( 7.32x)
avg_h264_qpel_16_mc12_8_ssse3:                         204.5 ( 8.59x)
avg_h264_qpel_16_mc21_8_c:                            1635.3 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          321.2 ( 5.09x)
avg_h264_qpel_16_mc21_8_ssse3:                         288.5 ( 5.67x)
avg_h264_qpel_16_mc22_8_c:                            1130.8 ( 1.00x)
avg_h264_qpel_16_mc22_8_sse2:                          219.4 ( 5.15x)
avg_h264_qpel_16_mc22_8_ssse3:                         182.2 ( 6.21x)
avg_h264_qpel_16_mc23_8_c:                            1622.5 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          321.3 ( 5.05x)
avg_h264_qpel_16_mc23_8_ssse3:                         289.5 ( 5.60x)
avg_h264_qpel_16_mc32_8_c:                            1762.5 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          236.1 ( 7.46x)
avg_h264_qpel_16_mc32_8_ssse3:                         205.2 ( 8.59x)
put_h264_qpel_8_mc12_8_c:                              427.2 ( 1.00x)
put_h264_qpel_8_mc12_8_sse2:                            72.1 ( 5.93x)
put_h264_qpel_8_mc12_8_ssse3:                           67.0 ( 6.38x)
put_h264_qpel_8_mc21_8_c:                              402.9 ( 1.00x)
put_h264_qpel_8_mc21_8_sse2:                            95.9 ( 4.20x)
put_h264_qpel_8_mc21_8_ssse3:                           91.9 ( 4.38x)
put_h264_qpel_8_mc22_8_c:                              235.0 ( 1.00x)
put_h264_qpel_8_mc22_8_sse2:                            64.6 ( 3.64x)
put_h264_qpel_8_mc22_8_ssse3:                           60.0 ( 3.92x)
put_h264_qpel_8_mc23_8_c:                              403.6 ( 1.00x)
put_h264_qpel_8_mc23_8_sse2:                            95.9 ( 4.21x)
put_h264_qpel_8_mc23_8_ssse3:                           91.7 ( 4.40x)
put_h264_qpel_8_mc32_8_c:                              430.7 ( 1.00x)
put_h264_qpel_8_mc32_8_sse2:                            72.1 ( 5.97x)
put_h264_qpel_8_mc32_8_ssse3:                           67.0 ( 6.43x)
put_h264_qpel_16_mc12_8_c:                            1724.2 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          230.7 ( 7.47x)
put_h264_qpel_16_mc12_8_ssse3:                         199.8 ( 8.63x)
put_h264_qpel_16_mc21_8_c:                            1613.3 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          327.5 ( 4.93x)
put_h264_qpel_16_mc21_8_ssse3:                         297.2 ( 5.43x)
put_h264_qpel_16_mc22_8_c:                             959.2 ( 1.00x)
put_h264_qpel_16_mc22_8_sse2:                          211.9 ( 4.53x)
put_h264_qpel_16_mc22_8_ssse3:                         186.1 ( 5.15x)
put_h264_qpel_16_mc23_8_c:                            1619.0 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          319.7 ( 5.06x)
put_h264_qpel_16_mc23_8_ssse3:                         299.2 ( 5.41x)
put_h264_qpel_16_mc32_8_c:                            1741.7 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          230.9 ( 7.54x)
put_h264_qpel_16_mc32_8_ssse3:                         199.4 ( 8.74x)

New benchmarks:
avg_h264_qpel_8_mc12_8_c:                              427.2 ( 1.00x)
avg_h264_qpel_8_mc12_8_sse2:                            63.9 ( 6.69x)
avg_h264_qpel_8_mc12_8_ssse3:                           69.2 ( 6.18x)
avg_h264_qpel_8_mc21_8_c:                              399.2 ( 1.00x)
avg_h264_qpel_8_mc21_8_sse2:                            87.7 ( 4.55x)
avg_h264_qpel_8_mc21_8_ssse3:                           93.9 ( 4.25x)
avg_h264_qpel_8_mc22_8_c:                              285.7 ( 1.00x)
avg_h264_qpel_8_mc22_8_sse2:                            56.4 ( 5.07x)
avg_h264_qpel_8_mc22_8_ssse3:                           62.6 ( 4.56x)
avg_h264_qpel_8_mc23_8_c:                              398.6 ( 1.00x)
avg_h264_qpel_8_mc23_8_sse2:                            87.6 ( 4.55x)
avg_h264_qpel_8_mc23_8_ssse3:                           93.8 ( 4.25x)
avg_h264_qpel_8_mc32_8_c:                              425.8 ( 1.00x)
avg_h264_qpel_8_mc32_8_sse2:                            63.8 ( 6.67x)
avg_h264_qpel_8_mc32_8_ssse3:                           69.0 ( 6.17x)
avg_h264_qpel_16_mc12_8_c:                            1748.2 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          198.5 ( 8.81x)
avg_h264_qpel_16_mc12_8_ssse3:                         203.2 ( 8.60x)
avg_h264_qpel_16_mc21_8_c:                            1638.1 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          277.4 ( 5.91x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.1 ( 5.63x)
avg_h264_qpel_16_mc22_8_c:                            1140.7 ( 1.00x)
avg_h264_qpel_16_mc22_8_sse2:                          180.3 ( 6.33x)
avg_h264_qpel_16_mc22_8_ssse3:                         181.9 ( 6.27x)
avg_h264_qpel_16_mc23_8_c:                            1629.9 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          278.0 ( 5.86x)
avg_h264_qpel_16_mc23_8_ssse3:                         291.0 ( 5.60x)
avg_h264_qpel_16_mc32_8_c:                            1752.1 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          193.7 ( 9.05x)
avg_h264_qpel_16_mc32_8_ssse3:                         203.4 ( 8.61x)
put_h264_qpel_8_mc12_8_c:                              421.8 ( 1.00x)
put_h264_qpel_8_mc12_8_sse2:                            61.7 ( 6.83x)
put_h264_qpel_8_mc12_8_ssse3:                           67.2 ( 6.28x)
put_h264_qpel_8_mc21_8_c:                              396.8 ( 1.00x)
put_h264_qpel_8_mc21_8_sse2:                            85.4 ( 4.65x)
put_h264_qpel_8_mc21_8_ssse3:                           91.6 ( 4.33x)
put_h264_qpel_8_mc22_8_c:                              234.1 ( 1.00x)
put_h264_qpel_8_mc22_8_sse2:                            54.4 ( 4.30x)
put_h264_qpel_8_mc22_8_ssse3:                           60.2 ( 3.89x)
put_h264_qpel_8_mc23_8_c:                              399.2 ( 1.00x)
put_h264_qpel_8_mc23_8_sse2:                            85.5 ( 4.67x)
put_h264_qpel_8_mc23_8_ssse3:                           91.8 ( 4.35x)
put_h264_qpel_8_mc32_8_c:                              422.2 ( 1.00x)
put_h264_qpel_8_mc32_8_sse2:                            61.8 ( 6.83x)
put_h264_qpel_8_mc32_8_ssse3:                           67.0 ( 6.30x)
put_h264_qpel_16_mc12_8_c:                            1720.3 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          189.9 ( 9.06x)
put_h264_qpel_16_mc12_8_ssse3:                         199.9 ( 8.61x)
put_h264_qpel_16_mc21_8_c:                            1624.5 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          285.4 ( 5.69x)
put_h264_qpel_16_mc21_8_ssse3:                         296.4 ( 5.48x)
put_h264_qpel_16_mc22_8_c:                             963.9 ( 1.00x)
put_h264_qpel_16_mc22_8_sse2:                          170.1 ( 5.67x)
put_h264_qpel_16_mc22_8_ssse3:                         186.4 ( 5.17x)
put_h264_qpel_16_mc23_8_c:                            1613.5 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          274.6 ( 5.88x)
put_h264_qpel_16_mc23_8_ssse3:                         300.4 ( 5.37x)
put_h264_qpel_16_mc32_8_c:                            1735.9 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          189.6 ( 9.15x)
put_h264_qpel_16_mc32_8_ssse3:                         199.5 ( 8.70x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
617c042093 avcodec/x86/h264_qpel_8bit: Avoid doing unnecessary work
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
29f439077a avcodec/h264_qpel: Move loop into qpel4_hv_lowpass_v_mmxext()
Every caller calls it three times in a loop, with slightly
modified arguments. So it makes sense to move the loop
into the callee.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
4539f7e4d4 avcodec/x86/h264_qpel_8bit: Don't duplicate qpel4_hv_lowpass_v_mmxext
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
3e2d9b73c1 avcodec/h264qpel: Move Snow-only code to snow.c
Blocksize 2 is Snow-only, so move all the code pertaining
to it to snow.c. Also make the put array in H264QpelContext
smaller -- it only needs three sets of 16 function pointers.
This continues 6eb8bc4217
and b0c91c2fba.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
15a4289b79 avcodec/x86/h264_qpel_8bit: Improve register allocation
None of the other registers need to be preserved at this time,
so six XMM registers are always enough. Forgotten in
fa9ea5113b.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 18:06:40 +02:00
Andreas Rheinhardt
dcfef80bd9 avcodec/pngenc: Mark unreachable default switch cases as such
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-07 17:36:25 +02:00
James Almer
6231fa7fb7 avcodec/av1dec: don't emit a warning when parsing isobmff style extradata
No OBUs may be present and it's a valid scenario, so only warn when parsing raw
extradata.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-05 22:23:51 -03:00
James Almer
78a16e42bd avcodec/av1dec: don't overwrite container level color information if none is coded in the bitstream
Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-05 13:22:23 -03:00
James Almer
009e4a1c20 avcodec/libdav1d: also consider user defined color information when selectiog pix_fmt
Fixes issue #20624.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-05 13:22:23 -03:00
James Almer
99034b581f avcodec/dcadsp: constify lfe_samples parameter
Signed-off-by: James Almer <jamrial@gmail.com>
2025-10-04 14:18:30 -03:00
Andreas Rheinhardt
8fad52bd57 avcodec/x86/h264_qpel: Use ptrdiff_t for strides
Avoids having to sign-extend the strides in the assembly
(it also is more correct given that the qpel_mc_func
already uses ptrdiff_t).

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
495c3d03ae avcodec/x86/h264_qpel_10bit: Remove SSE2 "cache64" duplicates
The horizontal 10bit MC SSE2 functions are currently duplicated:
They exist both in ordinary form as well as with a "sse2_cache64"
suffix. A comment in ff_h264qpel_init_x86() indicates that this
is due to older processors not liking accesses that cross cache
lines, yet these functions are identical to the non-cache64
functions (apart from the unavoidable changes in the rip-offset).

The only difference between these functions and the ordinary ones
are that the cache64 ones are created via a special form of the
INIT_XMM macro: "INIT_XMM sse2, cache64". This affects the name
and apparently defines cpuflags_cache64, yet nothing checks for
this, so both versions are identical. So remove the cache64 ones
and treat the remaining ones like ordinary SSE2 functions.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
697da64c8e avcodec/x86/h264_qpel: Port pixel8_l2_shift5 from MMXEXT to SSE2
This abides by the ABI (no missing emms) and yields a tiny
performance improvement here.

Old benchmarks:
avg_h264_qpel_8_mc12_8_c:                              419.9 ( 1.00x)
avg_h264_qpel_8_mc12_8_sse2:                            78.9 ( 5.32x)
avg_h264_qpel_8_mc12_8_ssse3:                           71.7 ( 5.86x)
avg_h264_qpel_8_mc32_8_c:                              429.1 ( 1.00x)
avg_h264_qpel_8_mc32_8_sse2:                            76.9 ( 5.58x)
avg_h264_qpel_8_mc32_8_ssse3:                           73.4 ( 5.84x)
put_h264_qpel_8_mc12_8_c:                              424.0 ( 1.00x)
put_h264_qpel_8_mc12_8_sse2:                            78.6 ( 5.40x)
put_h264_qpel_8_mc12_8_ssse3:                           70.6 ( 6.00x)
put_h264_qpel_8_mc32_8_c:                              425.7 ( 1.00x)
put_h264_qpel_8_mc32_8_sse2:                            75.2 ( 5.66x)
put_h264_qpel_8_mc32_8_ssse3:                           70.4 ( 6.05x)

New benchmarks:
avg_h264_qpel_8_mc12_8_c:                              425.7 ( 1.00x)
avg_h264_qpel_8_mc12_8_sse2:                            77.5 ( 5.49x)
avg_h264_qpel_8_mc12_8_ssse3:                           69.8 ( 6.10x)
avg_h264_qpel_8_mc32_8_c:                              423.7 ( 1.00x)
avg_h264_qpel_8_mc32_8_sse2:                            74.6 ( 5.68x)
avg_h264_qpel_8_mc32_8_ssse3:                           71.9 ( 5.89x)
put_h264_qpel_8_mc12_8_c:                              422.2 ( 1.00x)
put_h264_qpel_8_mc12_8_sse2:                            75.8 ( 5.57x)
put_h264_qpel_8_mc12_8_ssse3:                           67.9 ( 6.22x)
put_h264_qpel_8_mc32_8_c:                              421.8 ( 1.00x)
put_h264_qpel_8_mc32_8_sse2:                            72.6 ( 5.81x)
put_h264_qpel_8_mc32_8_ssse3:                           67.7 ( 6.23x)

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
4ac9162beb avcodec/x86/h264_qpel: Don't use ff_ prefix for static functions
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
cd077e88d1 avcodec/x86/h264_qpel: Add ff_{avg,put}_h264_qpel16_h_lowpass_l2_sse2()
These functions are currently emulated via four calls to the versions
for 8x8 blocks. In fact, the size savings from the simplified calls
in h264_qpel.c (GCC 1344B, Clang 1280B) more than outweigh the size
of the added functions (512B) here.

It is also beneficial performance-wise. Old benchmarks:
avg_h264_qpel_16_mc11_8_c:                            1414.1 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          206.2 ( 6.86x)
avg_h264_qpel_16_mc11_8_ssse3:                         177.7 ( 7.96x)
avg_h264_qpel_16_mc13_8_c:                            1417.0 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          207.4 ( 6.83x)
avg_h264_qpel_16_mc13_8_ssse3:                         178.2 ( 7.95x)
avg_h264_qpel_16_mc21_8_c:                            1632.8 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          349.3 ( 4.67x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.3 ( 5.60x)
avg_h264_qpel_16_mc23_8_c:                            1640.2 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          351.3 ( 4.67x)
avg_h264_qpel_16_mc23_8_ssse3:                         290.8 ( 5.64x)
avg_h264_qpel_16_mc31_8_c:                            1411.7 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          203.4 ( 6.94x)
avg_h264_qpel_16_mc31_8_ssse3:                         178.9 ( 7.89x)
avg_h264_qpel_16_mc33_8_c:                            1409.7 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          204.6 ( 6.89x)
avg_h264_qpel_16_mc33_8_ssse3:                         178.1 ( 7.92x)
put_h264_qpel_16_mc11_8_c:                            1391.0 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          197.4 ( 7.05x)
put_h264_qpel_16_mc11_8_ssse3:                         176.1 ( 7.90x)
put_h264_qpel_16_mc13_8_c:                            1395.9 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          196.7 ( 7.10x)
put_h264_qpel_16_mc13_8_ssse3:                         177.7 ( 7.85x)
put_h264_qpel_16_mc21_8_c:                            1609.5 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          341.1 ( 4.72x)
put_h264_qpel_16_mc21_8_ssse3:                         289.2 ( 5.57x)
put_h264_qpel_16_mc23_8_c:                            1604.0 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          340.9 ( 4.71x)
put_h264_qpel_16_mc23_8_ssse3:                         289.6 ( 5.54x)
put_h264_qpel_16_mc31_8_c:                            1390.2 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          194.6 ( 7.14x)
put_h264_qpel_16_mc31_8_ssse3:                         176.4 ( 7.88x)
put_h264_qpel_16_mc33_8_c:                            1400.4 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          198.5 ( 7.06x)
put_h264_qpel_16_mc33_8_ssse3:                         176.2 ( 7.95x)

New benchmarks:
avg_h264_qpel_16_mc11_8_c:                            1413.3 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          171.8 ( 8.23x)
avg_h264_qpel_16_mc11_8_ssse3:                         173.0 ( 8.17x)
avg_h264_qpel_16_mc13_8_c:                            1423.2 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          172.0 ( 8.27x)
avg_h264_qpel_16_mc13_8_ssse3:                         173.4 ( 8.21x)
avg_h264_qpel_16_mc21_8_c:                            1641.3 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          322.1 ( 5.10x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.3 ( 5.63x)
avg_h264_qpel_16_mc23_8_c:                            1629.1 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          323.0 ( 5.04x)
avg_h264_qpel_16_mc23_8_ssse3:                         293.3 ( 5.55x)
avg_h264_qpel_16_mc31_8_c:                            1409.2 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          172.0 ( 8.19x)
avg_h264_qpel_16_mc31_8_ssse3:                         173.7 ( 8.11x)
avg_h264_qpel_16_mc33_8_c:                            1402.5 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          172.5 ( 8.13x)
avg_h264_qpel_16_mc33_8_ssse3:                         173.6 ( 8.08x)
put_h264_qpel_16_mc11_8_c:                            1393.7 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          170.4 ( 8.18x)
put_h264_qpel_16_mc11_8_ssse3:                         178.2 ( 7.82x)
put_h264_qpel_16_mc13_8_c:                            1398.0 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          170.2 ( 8.21x)
put_h264_qpel_16_mc13_8_ssse3:                         178.6 ( 7.83x)
put_h264_qpel_16_mc21_8_c:                            1619.6 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          320.6 ( 5.05x)
put_h264_qpel_16_mc21_8_ssse3:                         297.2 ( 5.45x)
put_h264_qpel_16_mc23_8_c:                            1617.4 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          320.0 ( 5.05x)
put_h264_qpel_16_mc23_8_ssse3:                         297.4 ( 5.44x)
put_h264_qpel_16_mc31_8_c:                            1389.7 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          169.9 ( 8.18x)
put_h264_qpel_16_mc31_8_ssse3:                         178.1 ( 7.80x)
put_h264_qpel_16_mc33_8_c:                            1394.0 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          170.9 ( 8.16x)
put_h264_qpel_16_mc33_8_ssse3:                         176.9 ( 7.88x)

Notice that the SSSE3 versions of mc21 and mc23 benefit from
an optimized version of hv2_lowpass.

Also notice that there is no SSE2 version of the purely horizontal
motion compensation. This means that src2 is currently always aligned
when calling the SSE2 functions (and that srcStride is always equal
to the block width). Yet this has not been exploited (yet).

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
4880fa4dca avcodec/x86/h264_qpel_8bit: Remove dead macro
Forgotten in 4011a76494.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
35aaf697e9 avcodec/x86/h264_qpel_8bit: Replace qpel8_h_lowpass_l2 MMXEXT by SSE2
Using xmm registers here is very natural, as it allows to
operate on eight words at a time. It also saves 48B here
and does not clobber the MMX state.

Old benchmarks (only tests affected by the modified function are shown):
avg_h264_qpel_8_mc11_8_c:                              352.2 ( 1.00x)
avg_h264_qpel_8_mc11_8_sse2:                            70.4 ( 5.00x)
avg_h264_qpel_8_mc11_8_ssse3:                           53.9 ( 6.53x)
avg_h264_qpel_8_mc13_8_c:                              353.3 ( 1.00x)
avg_h264_qpel_8_mc13_8_sse2:                            72.8 ( 4.86x)
avg_h264_qpel_8_mc13_8_ssse3:                           53.8 ( 6.57x)
avg_h264_qpel_8_mc21_8_c:                              404.0 ( 1.00x)
avg_h264_qpel_8_mc21_8_sse2:                           116.1 ( 3.48x)
avg_h264_qpel_8_mc21_8_ssse3:                           94.3 ( 4.28x)
avg_h264_qpel_8_mc23_8_c:                              398.9 ( 1.00x)
avg_h264_qpel_8_mc23_8_sse2:                           118.6 ( 3.36x)
avg_h264_qpel_8_mc23_8_ssse3:                           94.8 ( 4.21x)
avg_h264_qpel_8_mc31_8_c:                              352.7 ( 1.00x)
avg_h264_qpel_8_mc31_8_sse2:                            71.4 ( 4.94x)
avg_h264_qpel_8_mc31_8_ssse3:                           53.8 ( 6.56x)
avg_h264_qpel_8_mc33_8_c:                              354.0 ( 1.00x)
avg_h264_qpel_8_mc33_8_sse2:                            70.6 ( 5.01x)
avg_h264_qpel_8_mc33_8_ssse3:                           53.7 ( 6.59x)
avg_h264_qpel_16_mc11_8_c:                            1417.0 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          276.9 ( 5.12x)
avg_h264_qpel_16_mc11_8_ssse3:                         178.8 ( 7.92x)
avg_h264_qpel_16_mc13_8_c:                            1427.3 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          277.4 ( 5.14x)
avg_h264_qpel_16_mc13_8_ssse3:                         179.7 ( 7.94x)
avg_h264_qpel_16_mc21_8_c:                            1634.1 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          421.3 ( 3.88x)
avg_h264_qpel_16_mc21_8_ssse3:                         291.2 ( 5.61x)
avg_h264_qpel_16_mc23_8_c:                            1627.0 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          420.8 ( 3.87x)
avg_h264_qpel_16_mc23_8_ssse3:                         291.0 ( 5.59x)
avg_h264_qpel_16_mc31_8_c:                            1418.4 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          278.5 ( 5.09x)
avg_h264_qpel_16_mc31_8_ssse3:                         178.6 ( 7.94x)
avg_h264_qpel_16_mc33_8_c:                            1407.3 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          277.6 ( 5.07x)
avg_h264_qpel_16_mc33_8_ssse3:                         179.9 ( 7.82x)
put_h264_qpel_8_mc11_8_c:                              348.1 ( 1.00x)
put_h264_qpel_8_mc11_8_sse2:                            69.1 ( 5.04x)
put_h264_qpel_8_mc11_8_ssse3:                           53.8 ( 6.47x)
put_h264_qpel_8_mc13_8_c:                              349.3 ( 1.00x)
put_h264_qpel_8_mc13_8_sse2:                            69.7 ( 5.01x)
put_h264_qpel_8_mc13_8_ssse3:                           53.7 ( 6.51x)
put_h264_qpel_8_mc21_8_c:                              398.5 ( 1.00x)
put_h264_qpel_8_mc21_8_sse2:                           115.0 ( 3.46x)
put_h264_qpel_8_mc21_8_ssse3:                           95.3 ( 4.18x)
put_h264_qpel_8_mc23_8_c:                              399.9 ( 1.00x)
put_h264_qpel_8_mc23_8_sse2:                           120.8 ( 3.31x)
put_h264_qpel_8_mc23_8_ssse3:                           95.4 ( 4.19x)
put_h264_qpel_8_mc31_8_c:                              350.4 ( 1.00x)
put_h264_qpel_8_mc31_8_sse2:                            69.6 ( 5.03x)
put_h264_qpel_8_mc31_8_ssse3:                           54.2 ( 6.47x)
put_h264_qpel_8_mc33_8_c:                              353.1 ( 1.00x)
put_h264_qpel_8_mc33_8_sse2:                            71.0 ( 4.97x)
put_h264_qpel_8_mc33_8_ssse3:                           54.2 ( 6.51x)
put_h264_qpel_16_mc11_8_c:                            1384.2 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          272.9 ( 5.07x)
put_h264_qpel_16_mc11_8_ssse3:                         178.3 ( 7.76x)
put_h264_qpel_16_mc13_8_c:                            1393.6 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          271.1 ( 5.14x)
put_h264_qpel_16_mc13_8_ssse3:                         178.3 ( 7.82x)
put_h264_qpel_16_mc21_8_c:                            1612.6 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          416.5 ( 3.87x)
put_h264_qpel_16_mc21_8_ssse3:                         289.1 ( 5.58x)
put_h264_qpel_16_mc23_8_c:                            1621.3 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          416.9 ( 3.89x)
put_h264_qpel_16_mc23_8_ssse3:                         289.4 ( 5.60x)
put_h264_qpel_16_mc31_8_c:                            1408.4 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          273.5 ( 5.15x)
put_h264_qpel_16_mc31_8_ssse3:                         176.9 ( 7.96x)
put_h264_qpel_16_mc33_8_c:                            1396.4 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          276.3 ( 5.05x)
put_h264_qpel_16_mc33_8_ssse3:                         176.4 ( 7.92x)

New benchmarks:
avg_h264_qpel_8_mc11_8_c:                              352.1 ( 1.00x)
avg_h264_qpel_8_mc11_8_sse2:                            52.5 ( 6.71x)
avg_h264_qpel_8_mc11_8_ssse3:                           53.9 ( 6.54x)
avg_h264_qpel_8_mc13_8_c:                              350.8 ( 1.00x)
avg_h264_qpel_8_mc13_8_sse2:                            54.7 ( 6.42x)
avg_h264_qpel_8_mc13_8_ssse3:                           54.3 ( 6.46x)
avg_h264_qpel_8_mc21_8_c:                              400.1 ( 1.00x)
avg_h264_qpel_8_mc21_8_sse2:                            98.6 ( 4.06x)
avg_h264_qpel_8_mc21_8_ssse3:                           95.5 ( 4.19x)
avg_h264_qpel_8_mc23_8_c:                              400.4 ( 1.00x)
avg_h264_qpel_8_mc23_8_sse2:                           101.4 ( 3.95x)
avg_h264_qpel_8_mc23_8_ssse3:                           95.9 ( 4.18x)
avg_h264_qpel_8_mc31_8_c:                              352.4 ( 1.00x)
avg_h264_qpel_8_mc31_8_sse2:                            52.9 ( 6.67x)
avg_h264_qpel_8_mc31_8_ssse3:                           54.4 ( 6.48x)
avg_h264_qpel_8_mc33_8_c:                              354.5 ( 1.00x)
avg_h264_qpel_8_mc33_8_sse2:                            52.9 ( 6.70x)
avg_h264_qpel_8_mc33_8_ssse3:                           54.4 ( 6.52x)
avg_h264_qpel_16_mc11_8_c:                            1420.4 ( 1.00x)
avg_h264_qpel_16_mc11_8_sse2:                          204.8 ( 6.93x)
avg_h264_qpel_16_mc11_8_ssse3:                         177.9 ( 7.98x)
avg_h264_qpel_16_mc13_8_c:                            1409.8 ( 1.00x)
avg_h264_qpel_16_mc13_8_sse2:                          206.4 ( 6.83x)
avg_h264_qpel_16_mc13_8_ssse3:                         178.0 ( 7.92x)
avg_h264_qpel_16_mc21_8_c:                            1634.1 ( 1.00x)
avg_h264_qpel_16_mc21_8_sse2:                          349.6 ( 4.67x)
avg_h264_qpel_16_mc21_8_ssse3:                         290.0 ( 5.63x)
avg_h264_qpel_16_mc23_8_c:                            1624.1 ( 1.00x)
avg_h264_qpel_16_mc23_8_sse2:                          350.0 ( 4.64x)
avg_h264_qpel_16_mc23_8_ssse3:                         291.9 ( 5.56x)
avg_h264_qpel_16_mc31_8_c:                            1407.2 ( 1.00x)
avg_h264_qpel_16_mc31_8_sse2:                          205.8 ( 6.84x)
avg_h264_qpel_16_mc31_8_ssse3:                         178.2 ( 7.90x)
avg_h264_qpel_16_mc33_8_c:                            1400.5 ( 1.00x)
avg_h264_qpel_16_mc33_8_sse2:                          206.3 ( 6.79x)
avg_h264_qpel_16_mc33_8_ssse3:                         179.4 ( 7.81x)
put_h264_qpel_8_mc11_8_c:                              349.7 ( 1.00x)
put_h264_qpel_8_mc11_8_sse2:                            50.2 ( 6.96x)
put_h264_qpel_8_mc11_8_ssse3:                           51.3 ( 6.82x)
put_h264_qpel_8_mc13_8_c:                              349.8 ( 1.00x)
put_h264_qpel_8_mc13_8_sse2:                            50.7 ( 6.90x)
put_h264_qpel_8_mc13_8_ssse3:                           51.7 ( 6.76x)
put_h264_qpel_8_mc21_8_c:                              398.0 ( 1.00x)
put_h264_qpel_8_mc21_8_sse2:                            96.5 ( 4.13x)
put_h264_qpel_8_mc21_8_ssse3:                           92.3 ( 4.31x)
put_h264_qpel_8_mc23_8_c:                              401.4 ( 1.00x)
put_h264_qpel_8_mc23_8_sse2:                           102.3 ( 3.92x)
put_h264_qpel_8_mc23_8_ssse3:                           92.8 ( 4.32x)
put_h264_qpel_8_mc31_8_c:                              349.4 ( 1.00x)
put_h264_qpel_8_mc31_8_sse2:                            50.8 ( 6.88x)
put_h264_qpel_8_mc31_8_ssse3:                           51.8 ( 6.75x)
put_h264_qpel_8_mc33_8_c:                              351.1 ( 1.00x)
put_h264_qpel_8_mc33_8_sse2:                            52.2 ( 6.73x)
put_h264_qpel_8_mc33_8_ssse3:                           51.7 ( 6.79x)
put_h264_qpel_16_mc11_8_c:                            1391.1 ( 1.00x)
put_h264_qpel_16_mc11_8_sse2:                          196.6 ( 7.07x)
put_h264_qpel_16_mc11_8_ssse3:                         178.2 ( 7.81x)
put_h264_qpel_16_mc13_8_c:                            1385.2 ( 1.00x)
put_h264_qpel_16_mc13_8_sse2:                          195.6 ( 7.08x)
put_h264_qpel_16_mc13_8_ssse3:                         176.6 ( 7.84x)
put_h264_qpel_16_mc21_8_c:                            1607.5 ( 1.00x)
put_h264_qpel_16_mc21_8_sse2:                          341.0 ( 4.71x)
put_h264_qpel_16_mc21_8_ssse3:                         289.1 ( 5.56x)
put_h264_qpel_16_mc23_8_c:                            1616.7 ( 1.00x)
put_h264_qpel_16_mc23_8_sse2:                          340.8 ( 4.74x)
put_h264_qpel_16_mc23_8_ssse3:                         288.6 ( 5.60x)
put_h264_qpel_16_mc31_8_c:                            1397.6 ( 1.00x)
put_h264_qpel_16_mc31_8_sse2:                          197.3 ( 7.08x)
put_h264_qpel_16_mc31_8_ssse3:                         175.4 ( 7.97x)
put_h264_qpel_16_mc33_8_c:                            1394.3 ( 1.00x)
put_h264_qpel_16_mc33_8_sse2:                          197.7 ( 7.05x)
put_h264_qpel_16_mc33_8_ssse3:                         175.2 ( 7.96x)

As can be seen, the SSE2 version is often neck-to-neck with the SSSE3
version (which also benefits from a better hv2_lowpass SSSE3
implementation for mc21 and mc23) for eight byte block sizes.
Unsurprisingly, SSSE3 beats SSE2 for 16x16 blocks: For SSE2,
these blocks are processed by calling the 8x8 function four times
whereas SSSE3 has a dedicated function (on x64).
This implementation should also be extendable to an AVX version
for 16x16 blocks.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
fa9ea5113b avcodec/x86/h264_qpel_8bit: Optimize branch away
ff_{avg,put}_h264_qpel8or16_hv2_lowpass_ssse3()
currently is almost the disjoint union of the codepaths
for sizes 8 and 16. This size is a compile-time constant
at every callsite. So split the function and avoid
the runtime branch.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
400203c00c avcodec/x86/h264_qpel: Remove unused parameter from hv2_lowpass funcs
tmpstride is unused. This also allows to remove said parameter
from lots of functions in h264_qpel.c.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
b84c818c83 avcodec/x86/h264_qpel: Remove constant parameters from shift5 funcs
They are constant since the size 16 version is no longer emulated
via the size 8 version.

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00
Andreas Rheinhardt
810bd3e62a avcodec/x86/h264_qpel: Add ff_{avg,put}_pixels16_l2_shift5_sse2
Up until now this function was emulated via two calls
to ff_{avg,pull}_pixels8_l2_shift5_mmxext(). Adding a dedicated
function proved beneficial both size wise and performance wise:
The new functions take 192B, yet the simplified calls save
256B with GCC and 320B with Clang here.

This change will also allow further optimizations.

Old benchmarks:
avg_h264_qpel_16_mc12_8_c:                            1735.8 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          300.8 ( 5.77x)
avg_h264_qpel_16_mc12_8_ssse3:                         233.3 ( 7.44x)
avg_h264_qpel_16_mc32_8_c:                            1777.9 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          275.6 ( 6.45x)
avg_h264_qpel_16_mc32_8_ssse3:                         235.7 ( 7.54x)
put_h264_qpel_16_mc12_8_c:                            1808.2 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          267.2 ( 6.77x)
put_h264_qpel_16_mc12_8_ssse3:                         231.9 ( 7.80x)
put_h264_qpel_16_mc32_8_c:                            1766.9 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          272.9 ( 6.47x)
put_h264_qpel_16_mc32_8_ssse3:                         229.5 ( 7.70x)

New benchmarks:
avg_h264_qpel_16_mc12_8_c:                            1742.3 ( 1.00x)
avg_h264_qpel_16_mc12_8_sse2:                          240.3 ( 7.25x)
avg_h264_qpel_16_mc12_8_ssse3:                         214.8 ( 8.11x)
avg_h264_qpel_16_mc32_8_c:                            1748.0 ( 1.00x)
avg_h264_qpel_16_mc32_8_sse2:                          238.0 ( 7.35x)
avg_h264_qpel_16_mc32_8_ssse3:                         209.2 ( 8.35x)
put_h264_qpel_16_mc12_8_c:                            2014.4 ( 1.00x)
put_h264_qpel_16_mc12_8_sse2:                          243.7 ( 8.27x)
put_h264_qpel_16_mc12_8_ssse3:                         211.5 ( 9.52x)
put_h264_qpel_16_mc32_8_c:                            1800.0 ( 1.00x)
put_h264_qpel_16_mc32_8_sse2:                          238.8 ( 7.54x)
put_h264_qpel_16_mc32_8_ssse3:                         206.7 ( 8.71x)

Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-04 07:06:33 +02:00