FFmpeg

mirror of https://mirror.skon.top/https://github.com/FFmpeg/FFmpeg synced 2026-04-23 02:11:14 +08:00

Author	SHA1	Message	Date
Manuel Lauss	3945d100ef	avcodec/sanm: remove unused SANMFrameHeader Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	2ef26c30eb	avcodec/sanm: implement BL16 subcodecs 1 and 7 Both of these encode a quarter-sized keyframe, with missing pixels interpolated from the immediate neighbours. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	b1a7f8b7cf	avcodec/sanm: factor out the ANIM decoding into separate function Mainly for readability. No functional changes. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	49c552d066	avcodec/sanm: restructure SANM like the other block codecs Restructe the SANM (or BL16 as LucasArts calls it) decoder to make it look like the others, as it is basically a development of old_codec47 for rgb565 values. No functional changes. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	4d5e87eaa4	Revert "avcodec/sanm: Check w,h,left,top" This reverts commit `134fbfd1dc`. As it breaks valid uses of this in Rebel Assault 1 videos. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	75b6937527	avcodec/sanm: reset rotate_code every iteration and eliminate the explicit reset in the other decoders that don't need it. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	de7db62acc	avcodec/sanm: rename process_block to codec47_block the new name better indicates where it belongs to. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	043dafc4c2	avcodec/sanm: codec37/47/48 size checks Add more size checks to old_codec37/47/48, esp. the headers. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	f98cd66b4b	avcodec/sanm: codec47: read the small codebook codec47 carries a 4-byte small codebook in its header. Read those 4 bytes into context member instead of awkwardly redirecting the bytestream pointer every time it needs to be accessed. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	72e6206c88	avcodec/sanm: partially fix codec48 The mv check introduced with `d5bdb0b705` broke MotS videos: - their height (300 lines) is 37,5 blocks; unfortunately the videos try to access up to 1 block more. Extend the mv check to the aligned_height, which fixes most artifacts. - don't return an error when an mv is invalid; rather skip the (subblock). Gets rid of almost all artifacts. Some artifacts still remain, esp in space scenes where the original encoder apparently fetched black pixels from outside of the aligned height. An increase of the buffer size by 8 lines will fix that later. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	24ce42b406	avcodec/sanm: codec4 improvements - don't draw outside the buffers - don't wrap around when coordinates go over the edge Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	dfe4a0626f	avcodec/sanm: codec31 improvements - don't draw outside the buffers - don't wrap around when coordinates go over the edge this is especially noticeable in the e.g. O1OPEN.ANM, C1C3PO.ANM RA1 files with planets wrapping around. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:29:00 +02:00
Manuel Lauss	da4b88494c	avcodec/sanm: codec1 improvements - don't draw outside the buffers - don't wrap around when coordinates go over the edge this is especially noticeable in the e.g. O1OPEN.ANM, C1C3PO.ANM RA1 files with planets wrapping around. Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:28:59 +02:00
Manuel Lauss	d18c25f1a9	avcodec/sanm: codec21 improvements - don't draw outside the buffers - don't wrap around when coordinates go over the edge Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:28:59 +02:00
Manuel Lauss	67b28acba3	avcodec/sanm: codec23 improvements - don't draw outside the buffers - don't wrap around when coordinates go over the edge Signed-off-by: Manuel Lauss <manuel.lauss@gmail.com>	2025-10-09 10:28:59 +02:00
James Almer	4377affc28	avcodec/hevc/refs: don't unconditionally discard non-IRAP frames if no IRAP frame was seen before Should fix issue #20661 Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-09 02:52:46 +00:00
Andreas Rheinhardt	378d5bb08a	avcodec/x86/fpel: Add blocksize x blocksize avg/put functions This commit deduplicates the wrappers around the fpel functions for copying whole blocks (i.e. height equaling width). It does this in a manner which avoids having push/pop function arguments when the calling convention forces one to pass them on the stack (as in 32bit systems). Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:53 +02:00
Andreas Rheinhardt	ad498f9759	avcodec/x86/cavsdsp: Remove MMXEXT Qpeldsp Superseded by SSE2. Saves about 11630B here. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	650098955e	avcodec/x86/cavs_qpel: Add SSE2 vertical motion compensation This is not based on the MMXEXT one, because the latter is quite suboptimal: Motion vector types mc01 and mc03 (vertical motion vectors with remainder of one quarter or three quarter) use different neighboring lines for interpolation: mc01 uses two lines above and two lines below, mc03 one line above and three lines below. The MMXEXT code uses a common macro for all of them and therefore reads six lines before it processes them (even reading lines which are not used at all), leading to severe register pressure. Another difference to the old code is that the positive and negative parts of the sum to calculate are accumulated separately and the subtraction is performed with unsigned saturation, so that one can avoid biasing the sum. The fact that the mc01 and mc03 filter coefficients are mirrors of each other has been exploited to reduce mc01 to mc03. But of course the most important different difference between this code and the MMXEXT one is that XMM registers allow to process eight words at a time, ideal for 8x8 subblocks, whereas the MMXEXT code processes them in 4x8 or 4x16 blocks. Benchmarks: avg_cavs_qpel_pixels_tab[0][4]_c: 917.0 ( 1.00x) avg_cavs_qpel_pixels_tab[0][4]_mmxext: 222.0 ( 4.13x) avg_cavs_qpel_pixels_tab[0][4]_sse2: 89.0 (10.31x) avg_cavs_qpel_pixels_tab[0][12]_c: 885.7 ( 1.00x) avg_cavs_qpel_pixels_tab[0][12]_mmxext: 223.2 ( 3.97x) avg_cavs_qpel_pixels_tab[0][12]_sse2: 88.5 (10.01x) avg_cavs_qpel_pixels_tab[1][4]_c: 222.4 ( 1.00x) avg_cavs_qpel_pixels_tab[1][4]_mmxext: 57.2 ( 3.89x) avg_cavs_qpel_pixels_tab[1][4]_sse2: 23.3 ( 9.55x) avg_cavs_qpel_pixels_tab[1][12]_c: 216.0 ( 1.00x) avg_cavs_qpel_pixels_tab[1][12]_mmxext: 57.4 ( 3.76x) avg_cavs_qpel_pixels_tab[1][12]_sse2: 22.6 ( 9.56x) put_cavs_qpel_pixels_tab[0][4]_c: 750.9 ( 1.00x) put_cavs_qpel_pixels_tab[0][4]_mmxext: 210.4 ( 3.57x) put_cavs_qpel_pixels_tab[0][4]_sse2: 84.2 ( 8.92x) put_cavs_qpel_pixels_tab[0][12]_c: 731.6 ( 1.00x) put_cavs_qpel_pixels_tab[0][12]_mmxext: 210.7 ( 3.47x) put_cavs_qpel_pixels_tab[0][12]_sse2: 84.1 ( 8.70x) put_cavs_qpel_pixels_tab[1][4]_c: 191.7 ( 1.00x) put_cavs_qpel_pixels_tab[1][4]_mmxext: 53.8 ( 3.56x) put_cavs_qpel_pixels_tab[1][4]_sse2: 24.5 ( 7.83x) put_cavs_qpel_pixels_tab[1][12]_c: 179.1 ( 1.00x) put_cavs_qpel_pixels_tab[1][12]_mmxext: 53.9 ( 3.32x) put_cavs_qpel_pixels_tab[1][12]_sse2: 24.0 ( 7.47x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	74a88c0c11	avcodec/x86/cavsdsp: Add SSE2 mc20 horizontal motion compensation Basically a direct port of the MMXEXT one. The main difference is of course that one can process eight pixels (unpacked to words) at a time, leading to speedups. avg_cavs_qpel_pixels_tab[0][2]_c: 700.1 ( 1.00x) avg_cavs_qpel_pixels_tab[0][2]_mmxext: 158.1 ( 4.43x) avg_cavs_qpel_pixels_tab[0][2]_sse2: 86.0 ( 8.14x) avg_cavs_qpel_pixels_tab[1][2]_c: 171.9 ( 1.00x) avg_cavs_qpel_pixels_tab[1][2]_mmxext: 39.4 ( 4.36x) avg_cavs_qpel_pixels_tab[1][2]_sse2: 21.7 ( 7.92x) put_cavs_qpel_pixels_tab[0][2]_c: 525.7 ( 1.00x) put_cavs_qpel_pixels_tab[0][2]_mmxext: 148.5 ( 3.54x) put_cavs_qpel_pixels_tab[0][2]_sse2: 75.2 ( 6.99x) put_cavs_qpel_pixels_tab[1][2]_c: 129.5 ( 1.00x) put_cavs_qpel_pixels_tab[1][2]_mmxext: 36.7 ( 3.53x) put_cavs_qpel_pixels_tab[1][2]_sse2: 19.0 ( 6.81x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	cc2e2f12ca	avcodec/x86/cavsdsp: Fix vertical qpel motion compensation The prediction involves terms of the form (-1 * s0 - 2 * s1 + 96 * s2 + 42 * s3 - 7 * s4 + 64) >> 7, where the s values are in the range of 0..255. The sum can have values in the range -2550..35190, which does not fit into a signed 16bit integer. The code uses an arithmetic right shift, which does not yield the correct result for values >= 2^15; such values should be clipped to 255, yet are clipped to 0 instead. Fix this by shifting the values by 4096, so that the range is positive, use a logical right shift and subtract 32. bunny.mp4 from the FATE suite can be used to reproduce the problem. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Andreas Rheinhardt	ec2fe94b3f	avcodec/cavs: Remove unused parameter Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-08 20:40:08 +02:00
Michael Niedermayer	7896cc67c1	avcodec/exr: Check that DWA has 3 channels The implementation hardcodes access to 3 channels, so we need to check that Fixes: out of array access Fixes: BIGSLEEP-445394503-crash.exr Found-by: Google Big Sleep Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-08 00:27:49 +00:00
Michael Niedermayer	c911e00011	avcodec/exr: Round dc_w/h up Without rounding them up there are too few dc coeffs for the blocks. We do not know if this way of handling odd dimensions is correct, as we have no such DWA sample. thus we ask the user for a sample if she encounters such a file Fixes: out of array access Fixes: BIGSLEEP-445392027-crash.exr Found-by: Google Big Sleep Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-08 00:27:49 +00:00
Michael Niedermayer	8e078826da	avcodec/exr: check ac_size Fixes: out of array read Fixes: dwa_uncompress.py.crash.exr The code will read from the ac data even if ac_size is 0, thus that case is not implemented and we ask for a sample and error out cleanly Found-by: Google Big Sleep Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>	2025-10-08 00:27:49 +00:00
Baptiste Coudurier	ef60d5ac32	general: fix warning 'av_malloc_array' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Wcalloc-transposed-args] Fixes trac ticket #11620	2025-10-07 14:51:46 -07:00
Andreas Rheinhardt	00225e9ebc	avcodec/x86/h264_qpel: Simplify macros 1. Remove the OP parameter from the QPEL_H264* macros. These are a remnant of inline assembly and were forgotten in `610e00b359`. 2. Pass the instruction set extension for the shift5 function explicitly in the macro instead of using magic #defines. 3. Likewise, avoid magic #defines for (8\|16)_v_lowpass_ssse3. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	3049694e9f	avcodec/x86/h264_qpel: Split hv2_lowpass_sse2 into size 8,16 funcs This is beneficial size-wise: 384B of new asm functions are more than outweighted by 416B savings from simpler calls here (for size 16, the size 8 function had been called twice). It also makes the code more readable, as it allowed to remove several wrappers in h264_qpel.c. It is also beneficial performance-wise. Old benchmarks: avg_h264_qpel_16_mc12_8_c: 1757.7 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 197.7 ( 8.89x) avg_h264_qpel_16_mc12_8_ssse3: 204.6 ( 8.59x) avg_h264_qpel_16_mc21_8_c: 1631.6 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 276.4 ( 5.90x) avg_h264_qpel_16_mc21_8_ssse3: 290.7 ( 5.61x) avg_h264_qpel_16_mc22_8_c: 1122.7 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 179.5 ( 6.25x) avg_h264_qpel_16_mc22_8_ssse3: 181.8 ( 6.17x) avg_h264_qpel_16_mc23_8_c: 1626.7 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 276.8 ( 5.88x) avg_h264_qpel_16_mc23_8_ssse3: 290.9 ( 5.59x) avg_h264_qpel_16_mc32_8_c: 1754.1 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 193.8 ( 9.05x) avg_h264_qpel_16_mc32_8_ssse3: 203.6 ( 8.62x) put_h264_qpel_16_mc12_8_c: 1733.6 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 189.6 ( 9.14x) put_h264_qpel_16_mc12_8_ssse3: 199.6 ( 8.69x) put_h264_qpel_16_mc21_8_c: 1616.0 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 284.3 ( 5.69x) put_h264_qpel_16_mc21_8_ssse3: 296.5 ( 5.45x) put_h264_qpel_16_mc22_8_c: 963.7 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 169.9 ( 5.67x) put_h264_qpel_16_mc22_8_ssse3: 186.1 ( 5.18x) put_h264_qpel_16_mc23_8_c: 1607.2 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 275.0 ( 5.84x) put_h264_qpel_16_mc23_8_ssse3: 297.8 ( 5.40x) put_h264_qpel_16_mc32_8_c: 1734.7 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 189.4 ( 9.16x) put_h264_qpel_16_mc32_8_ssse3: 199.4 ( 8.70x) New benchmarks: avg_h264_qpel_16_mc12_8_c: 1743.7 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 189.7 ( 9.19x) avg_h264_qpel_16_mc12_8_ssse3: 204.4 ( 8.53x) avg_h264_qpel_16_mc21_8_c: 1637.7 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 267.7 ( 6.12x) avg_h264_qpel_16_mc21_8_ssse3: 291.5 ( 5.62x) avg_h264_qpel_16_mc22_8_c: 1150.3 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 164.6 ( 6.99x) avg_h264_qpel_16_mc22_8_ssse3: 182.1 ( 6.32x) avg_h264_qpel_16_mc23_8_c: 1635.3 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 268.5 ( 6.09x) avg_h264_qpel_16_mc23_8_ssse3: 298.5 ( 5.48x) avg_h264_qpel_16_mc32_8_c: 1740.6 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 182.6 ( 9.53x) avg_h264_qpel_16_mc32_8_ssse3: 201.9 ( 8.62x) put_h264_qpel_16_mc12_8_c: 1727.4 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 188.1 ( 9.18x) put_h264_qpel_16_mc12_8_ssse3: 199.6 ( 8.65x) put_h264_qpel_16_mc21_8_c: 1623.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 265.9 ( 6.11x) put_h264_qpel_16_mc21_8_ssse3: 299.4 ( 5.42x) put_h264_qpel_16_mc22_8_c: 954.0 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 161.8 ( 5.89x) put_h264_qpel_16_mc22_8_ssse3: 180.4 ( 5.29x) put_h264_qpel_16_mc23_8_c: 1611.2 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 265.8 ( 6.06x) put_h264_qpel_16_mc23_8_ssse3: 300.3 ( 5.37x) put_h264_qpel_16_mc32_8_c: 1734.5 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 180.0 ( 9.63x) put_h264_qpel_16_mc32_8_ssse3: 199.7 ( 8.69x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	3ed590c7b9	avcodec/x86/h264_qpel: Port qpel8or16_hv2_lowpass_op_mmxext to SSE2 This means that only blocksize 4 still uses mmx(ext). Old benchmarks: avg_h264_qpel_8_mc12_8_c: 428.4 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 74.3 ( 5.77x) avg_h264_qpel_8_mc12_8_ssse3: 69.3 ( 6.18x) avg_h264_qpel_8_mc21_8_c: 401.4 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 97.8 ( 4.10x) avg_h264_qpel_8_mc21_8_ssse3: 93.7 ( 4.28x) avg_h264_qpel_8_mc22_8_c: 281.8 ( 1.00x) avg_h264_qpel_8_mc22_8_sse2: 66.7 ( 4.23x) avg_h264_qpel_8_mc22_8_ssse3: 62.6 ( 4.50x) avg_h264_qpel_8_mc23_8_c: 397.2 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 97.9 ( 4.06x) avg_h264_qpel_8_mc23_8_ssse3: 93.7 ( 4.24x) avg_h264_qpel_8_mc32_8_c: 432.4 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 73.9 ( 5.85x) avg_h264_qpel_8_mc32_8_ssse3: 69.5 ( 6.22x) avg_h264_qpel_16_mc12_8_c: 1756.4 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 240.0 ( 7.32x) avg_h264_qpel_16_mc12_8_ssse3: 204.5 ( 8.59x) avg_h264_qpel_16_mc21_8_c: 1635.3 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 321.2 ( 5.09x) avg_h264_qpel_16_mc21_8_ssse3: 288.5 ( 5.67x) avg_h264_qpel_16_mc22_8_c: 1130.8 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 219.4 ( 5.15x) avg_h264_qpel_16_mc22_8_ssse3: 182.2 ( 6.21x) avg_h264_qpel_16_mc23_8_c: 1622.5 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 321.3 ( 5.05x) avg_h264_qpel_16_mc23_8_ssse3: 289.5 ( 5.60x) avg_h264_qpel_16_mc32_8_c: 1762.5 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 236.1 ( 7.46x) avg_h264_qpel_16_mc32_8_ssse3: 205.2 ( 8.59x) put_h264_qpel_8_mc12_8_c: 427.2 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 72.1 ( 5.93x) put_h264_qpel_8_mc12_8_ssse3: 67.0 ( 6.38x) put_h264_qpel_8_mc21_8_c: 402.9 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 95.9 ( 4.20x) put_h264_qpel_8_mc21_8_ssse3: 91.9 ( 4.38x) put_h264_qpel_8_mc22_8_c: 235.0 ( 1.00x) put_h264_qpel_8_mc22_8_sse2: 64.6 ( 3.64x) put_h264_qpel_8_mc22_8_ssse3: 60.0 ( 3.92x) put_h264_qpel_8_mc23_8_c: 403.6 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 95.9 ( 4.21x) put_h264_qpel_8_mc23_8_ssse3: 91.7 ( 4.40x) put_h264_qpel_8_mc32_8_c: 430.7 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 72.1 ( 5.97x) put_h264_qpel_8_mc32_8_ssse3: 67.0 ( 6.43x) put_h264_qpel_16_mc12_8_c: 1724.2 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 230.7 ( 7.47x) put_h264_qpel_16_mc12_8_ssse3: 199.8 ( 8.63x) put_h264_qpel_16_mc21_8_c: 1613.3 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 327.5 ( 4.93x) put_h264_qpel_16_mc21_8_ssse3: 297.2 ( 5.43x) put_h264_qpel_16_mc22_8_c: 959.2 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 211.9 ( 4.53x) put_h264_qpel_16_mc22_8_ssse3: 186.1 ( 5.15x) put_h264_qpel_16_mc23_8_c: 1619.0 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 319.7 ( 5.06x) put_h264_qpel_16_mc23_8_ssse3: 299.2 ( 5.41x) put_h264_qpel_16_mc32_8_c: 1741.7 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 230.9 ( 7.54x) put_h264_qpel_16_mc32_8_ssse3: 199.4 ( 8.74x) New benchmarks: avg_h264_qpel_8_mc12_8_c: 427.2 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 63.9 ( 6.69x) avg_h264_qpel_8_mc12_8_ssse3: 69.2 ( 6.18x) avg_h264_qpel_8_mc21_8_c: 399.2 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 87.7 ( 4.55x) avg_h264_qpel_8_mc21_8_ssse3: 93.9 ( 4.25x) avg_h264_qpel_8_mc22_8_c: 285.7 ( 1.00x) avg_h264_qpel_8_mc22_8_sse2: 56.4 ( 5.07x) avg_h264_qpel_8_mc22_8_ssse3: 62.6 ( 4.56x) avg_h264_qpel_8_mc23_8_c: 398.6 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 87.6 ( 4.55x) avg_h264_qpel_8_mc23_8_ssse3: 93.8 ( 4.25x) avg_h264_qpel_8_mc32_8_c: 425.8 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 63.8 ( 6.67x) avg_h264_qpel_8_mc32_8_ssse3: 69.0 ( 6.17x) avg_h264_qpel_16_mc12_8_c: 1748.2 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 198.5 ( 8.81x) avg_h264_qpel_16_mc12_8_ssse3: 203.2 ( 8.60x) avg_h264_qpel_16_mc21_8_c: 1638.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 277.4 ( 5.91x) avg_h264_qpel_16_mc21_8_ssse3: 291.1 ( 5.63x) avg_h264_qpel_16_mc22_8_c: 1140.7 ( 1.00x) avg_h264_qpel_16_mc22_8_sse2: 180.3 ( 6.33x) avg_h264_qpel_16_mc22_8_ssse3: 181.9 ( 6.27x) avg_h264_qpel_16_mc23_8_c: 1629.9 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 278.0 ( 5.86x) avg_h264_qpel_16_mc23_8_ssse3: 291.0 ( 5.60x) avg_h264_qpel_16_mc32_8_c: 1752.1 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 193.7 ( 9.05x) avg_h264_qpel_16_mc32_8_ssse3: 203.4 ( 8.61x) put_h264_qpel_8_mc12_8_c: 421.8 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 61.7 ( 6.83x) put_h264_qpel_8_mc12_8_ssse3: 67.2 ( 6.28x) put_h264_qpel_8_mc21_8_c: 396.8 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 85.4 ( 4.65x) put_h264_qpel_8_mc21_8_ssse3: 91.6 ( 4.33x) put_h264_qpel_8_mc22_8_c: 234.1 ( 1.00x) put_h264_qpel_8_mc22_8_sse2: 54.4 ( 4.30x) put_h264_qpel_8_mc22_8_ssse3: 60.2 ( 3.89x) put_h264_qpel_8_mc23_8_c: 399.2 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 85.5 ( 4.67x) put_h264_qpel_8_mc23_8_ssse3: 91.8 ( 4.35x) put_h264_qpel_8_mc32_8_c: 422.2 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 61.8 ( 6.83x) put_h264_qpel_8_mc32_8_ssse3: 67.0 ( 6.30x) put_h264_qpel_16_mc12_8_c: 1720.3 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 189.9 ( 9.06x) put_h264_qpel_16_mc12_8_ssse3: 199.9 ( 8.61x) put_h264_qpel_16_mc21_8_c: 1624.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 285.4 ( 5.69x) put_h264_qpel_16_mc21_8_ssse3: 296.4 ( 5.48x) put_h264_qpel_16_mc22_8_c: 963.9 ( 1.00x) put_h264_qpel_16_mc22_8_sse2: 170.1 ( 5.67x) put_h264_qpel_16_mc22_8_ssse3: 186.4 ( 5.17x) put_h264_qpel_16_mc23_8_c: 1613.5 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 274.6 ( 5.88x) put_h264_qpel_16_mc23_8_ssse3: 300.4 ( 5.37x) put_h264_qpel_16_mc32_8_c: 1735.9 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 189.6 ( 9.15x) put_h264_qpel_16_mc32_8_ssse3: 199.5 ( 8.70x) Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	617c042093	avcodec/x86/h264_qpel_8bit: Avoid doing unnecessary work Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	29f439077a	avcodec/h264_qpel: Move loop into qpel4_hv_lowpass_v_mmxext() Every caller calls it three times in a loop, with slightly modified arguments. So it makes sense to move the loop into the callee. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	4539f7e4d4	avcodec/x86/h264_qpel_8bit: Don't duplicate qpel4_hv_lowpass_v_mmxext Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	3e2d9b73c1	avcodec/h264qpel: Move Snow-only code to snow.c Blocksize 2 is Snow-only, so move all the code pertaining to it to snow.c. Also make the put array in H264QpelContext smaller -- it only needs three sets of 16 function pointers. This continues `6eb8bc4217` and `b0c91c2fba`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	15a4289b79	avcodec/x86/h264_qpel_8bit: Improve register allocation None of the other registers need to be preserved at this time, so six XMM registers are always enough. Forgotten in `fa9ea5113b`. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 18:06:40 +02:00
Andreas Rheinhardt	dcfef80bd9	avcodec/pngenc: Mark unreachable default switch cases as such Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-07 17:36:25 +02:00
James Almer	6231fa7fb7	avcodec/av1dec: don't emit a warning when parsing isobmff style extradata No OBUs may be present and it's a valid scenario, so only warn when parsing raw extradata. Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-05 22:23:51 -03:00
James Almer	78a16e42bd	avcodec/av1dec: don't overwrite container level color information if none is coded in the bitstream Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-05 13:22:23 -03:00
James Almer	009e4a1c20	avcodec/libdav1d: also consider user defined color information when selectiog pix_fmt Fixes issue #20624. Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-05 13:22:23 -03:00
James Almer	99034b581f	avcodec/dcadsp: constify lfe_samples parameter Signed-off-by: James Almer <jamrial@gmail.com>	2025-10-04 14:18:30 -03:00
Andreas Rheinhardt	8fad52bd57	avcodec/x86/h264_qpel: Use ptrdiff_t for strides Avoids having to sign-extend the strides in the assembly (it also is more correct given that the qpel_mc_func already uses ptrdiff_t). Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	495c3d03ae	avcodec/x86/h264_qpel_10bit: Remove SSE2 "cache64" duplicates The horizontal 10bit MC SSE2 functions are currently duplicated: They exist both in ordinary form as well as with a "sse2_cache64" suffix. A comment in ff_h264qpel_init_x86() indicates that this is due to older processors not liking accesses that cross cache lines, yet these functions are identical to the non-cache64 functions (apart from the unavoidable changes in the rip-offset). The only difference between these functions and the ordinary ones are that the cache64 ones are created via a special form of the INIT_XMM macro: "INIT_XMM sse2, cache64". This affects the name and apparently defines cpuflags_cache64, yet nothing checks for this, so both versions are identical. So remove the cache64 ones and treat the remaining ones like ordinary SSE2 functions. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	697da64c8e	avcodec/x86/h264_qpel: Port pixel8_l2_shift5 from MMXEXT to SSE2 This abides by the ABI (no missing emms) and yields a tiny performance improvement here. Old benchmarks: avg_h264_qpel_8_mc12_8_c: 419.9 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 78.9 ( 5.32x) avg_h264_qpel_8_mc12_8_ssse3: 71.7 ( 5.86x) avg_h264_qpel_8_mc32_8_c: 429.1 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 76.9 ( 5.58x) avg_h264_qpel_8_mc32_8_ssse3: 73.4 ( 5.84x) put_h264_qpel_8_mc12_8_c: 424.0 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 78.6 ( 5.40x) put_h264_qpel_8_mc12_8_ssse3: 70.6 ( 6.00x) put_h264_qpel_8_mc32_8_c: 425.7 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 75.2 ( 5.66x) put_h264_qpel_8_mc32_8_ssse3: 70.4 ( 6.05x) New benchmarks: avg_h264_qpel_8_mc12_8_c: 425.7 ( 1.00x) avg_h264_qpel_8_mc12_8_sse2: 77.5 ( 5.49x) avg_h264_qpel_8_mc12_8_ssse3: 69.8 ( 6.10x) avg_h264_qpel_8_mc32_8_c: 423.7 ( 1.00x) avg_h264_qpel_8_mc32_8_sse2: 74.6 ( 5.68x) avg_h264_qpel_8_mc32_8_ssse3: 71.9 ( 5.89x) put_h264_qpel_8_mc12_8_c: 422.2 ( 1.00x) put_h264_qpel_8_mc12_8_sse2: 75.8 ( 5.57x) put_h264_qpel_8_mc12_8_ssse3: 67.9 ( 6.22x) put_h264_qpel_8_mc32_8_c: 421.8 ( 1.00x) put_h264_qpel_8_mc32_8_sse2: 72.6 ( 5.81x) put_h264_qpel_8_mc32_8_ssse3: 67.7 ( 6.23x) Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	4ac9162beb	avcodec/x86/h264_qpel: Don't use ff_ prefix for static functions Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	cd077e88d1	avcodec/x86/h264_qpel: Add ff_{avg,put}_h264_qpel16_h_lowpass_l2_sse2() These functions are currently emulated via four calls to the versions for 8x8 blocks. In fact, the size savings from the simplified calls in h264_qpel.c (GCC 1344B, Clang 1280B) more than outweigh the size of the added functions (512B) here. It is also beneficial performance-wise. Old benchmarks: avg_h264_qpel_16_mc11_8_c: 1414.1 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 206.2 ( 6.86x) avg_h264_qpel_16_mc11_8_ssse3: 177.7 ( 7.96x) avg_h264_qpel_16_mc13_8_c: 1417.0 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 207.4 ( 6.83x) avg_h264_qpel_16_mc13_8_ssse3: 178.2 ( 7.95x) avg_h264_qpel_16_mc21_8_c: 1632.8 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 349.3 ( 4.67x) avg_h264_qpel_16_mc21_8_ssse3: 291.3 ( 5.60x) avg_h264_qpel_16_mc23_8_c: 1640.2 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 351.3 ( 4.67x) avg_h264_qpel_16_mc23_8_ssse3: 290.8 ( 5.64x) avg_h264_qpel_16_mc31_8_c: 1411.7 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 203.4 ( 6.94x) avg_h264_qpel_16_mc31_8_ssse3: 178.9 ( 7.89x) avg_h264_qpel_16_mc33_8_c: 1409.7 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 204.6 ( 6.89x) avg_h264_qpel_16_mc33_8_ssse3: 178.1 ( 7.92x) put_h264_qpel_16_mc11_8_c: 1391.0 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 197.4 ( 7.05x) put_h264_qpel_16_mc11_8_ssse3: 176.1 ( 7.90x) put_h264_qpel_16_mc13_8_c: 1395.9 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 196.7 ( 7.10x) put_h264_qpel_16_mc13_8_ssse3: 177.7 ( 7.85x) put_h264_qpel_16_mc21_8_c: 1609.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 341.1 ( 4.72x) put_h264_qpel_16_mc21_8_ssse3: 289.2 ( 5.57x) put_h264_qpel_16_mc23_8_c: 1604.0 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 340.9 ( 4.71x) put_h264_qpel_16_mc23_8_ssse3: 289.6 ( 5.54x) put_h264_qpel_16_mc31_8_c: 1390.2 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 194.6 ( 7.14x) put_h264_qpel_16_mc31_8_ssse3: 176.4 ( 7.88x) put_h264_qpel_16_mc33_8_c: 1400.4 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 198.5 ( 7.06x) put_h264_qpel_16_mc33_8_ssse3: 176.2 ( 7.95x) New benchmarks: avg_h264_qpel_16_mc11_8_c: 1413.3 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 171.8 ( 8.23x) avg_h264_qpel_16_mc11_8_ssse3: 173.0 ( 8.17x) avg_h264_qpel_16_mc13_8_c: 1423.2 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 172.0 ( 8.27x) avg_h264_qpel_16_mc13_8_ssse3: 173.4 ( 8.21x) avg_h264_qpel_16_mc21_8_c: 1641.3 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 322.1 ( 5.10x) avg_h264_qpel_16_mc21_8_ssse3: 291.3 ( 5.63x) avg_h264_qpel_16_mc23_8_c: 1629.1 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 323.0 ( 5.04x) avg_h264_qpel_16_mc23_8_ssse3: 293.3 ( 5.55x) avg_h264_qpel_16_mc31_8_c: 1409.2 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 172.0 ( 8.19x) avg_h264_qpel_16_mc31_8_ssse3: 173.7 ( 8.11x) avg_h264_qpel_16_mc33_8_c: 1402.5 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 172.5 ( 8.13x) avg_h264_qpel_16_mc33_8_ssse3: 173.6 ( 8.08x) put_h264_qpel_16_mc11_8_c: 1393.7 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 170.4 ( 8.18x) put_h264_qpel_16_mc11_8_ssse3: 178.2 ( 7.82x) put_h264_qpel_16_mc13_8_c: 1398.0 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 170.2 ( 8.21x) put_h264_qpel_16_mc13_8_ssse3: 178.6 ( 7.83x) put_h264_qpel_16_mc21_8_c: 1619.6 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 320.6 ( 5.05x) put_h264_qpel_16_mc21_8_ssse3: 297.2 ( 5.45x) put_h264_qpel_16_mc23_8_c: 1617.4 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 320.0 ( 5.05x) put_h264_qpel_16_mc23_8_ssse3: 297.4 ( 5.44x) put_h264_qpel_16_mc31_8_c: 1389.7 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 169.9 ( 8.18x) put_h264_qpel_16_mc31_8_ssse3: 178.1 ( 7.80x) put_h264_qpel_16_mc33_8_c: 1394.0 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 170.9 ( 8.16x) put_h264_qpel_16_mc33_8_ssse3: 176.9 ( 7.88x) Notice that the SSSE3 versions of mc21 and mc23 benefit from an optimized version of hv2_lowpass. Also notice that there is no SSE2 version of the purely horizontal motion compensation. This means that src2 is currently always aligned when calling the SSE2 functions (and that srcStride is always equal to the block width). Yet this has not been exploited (yet). Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	4880fa4dca	avcodec/x86/h264_qpel_8bit: Remove dead macro Forgotten in `4011a76494`. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	35aaf697e9	avcodec/x86/h264_qpel_8bit: Replace qpel8_h_lowpass_l2 MMXEXT by SSE2 Using xmm registers here is very natural, as it allows to operate on eight words at a time. It also saves 48B here and does not clobber the MMX state. Old benchmarks (only tests affected by the modified function are shown): avg_h264_qpel_8_mc11_8_c: 352.2 ( 1.00x) avg_h264_qpel_8_mc11_8_sse2: 70.4 ( 5.00x) avg_h264_qpel_8_mc11_8_ssse3: 53.9 ( 6.53x) avg_h264_qpel_8_mc13_8_c: 353.3 ( 1.00x) avg_h264_qpel_8_mc13_8_sse2: 72.8 ( 4.86x) avg_h264_qpel_8_mc13_8_ssse3: 53.8 ( 6.57x) avg_h264_qpel_8_mc21_8_c: 404.0 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 116.1 ( 3.48x) avg_h264_qpel_8_mc21_8_ssse3: 94.3 ( 4.28x) avg_h264_qpel_8_mc23_8_c: 398.9 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 118.6 ( 3.36x) avg_h264_qpel_8_mc23_8_ssse3: 94.8 ( 4.21x) avg_h264_qpel_8_mc31_8_c: 352.7 ( 1.00x) avg_h264_qpel_8_mc31_8_sse2: 71.4 ( 4.94x) avg_h264_qpel_8_mc31_8_ssse3: 53.8 ( 6.56x) avg_h264_qpel_8_mc33_8_c: 354.0 ( 1.00x) avg_h264_qpel_8_mc33_8_sse2: 70.6 ( 5.01x) avg_h264_qpel_8_mc33_8_ssse3: 53.7 ( 6.59x) avg_h264_qpel_16_mc11_8_c: 1417.0 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 276.9 ( 5.12x) avg_h264_qpel_16_mc11_8_ssse3: 178.8 ( 7.92x) avg_h264_qpel_16_mc13_8_c: 1427.3 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 277.4 ( 5.14x) avg_h264_qpel_16_mc13_8_ssse3: 179.7 ( 7.94x) avg_h264_qpel_16_mc21_8_c: 1634.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 421.3 ( 3.88x) avg_h264_qpel_16_mc21_8_ssse3: 291.2 ( 5.61x) avg_h264_qpel_16_mc23_8_c: 1627.0 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 420.8 ( 3.87x) avg_h264_qpel_16_mc23_8_ssse3: 291.0 ( 5.59x) avg_h264_qpel_16_mc31_8_c: 1418.4 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 278.5 ( 5.09x) avg_h264_qpel_16_mc31_8_ssse3: 178.6 ( 7.94x) avg_h264_qpel_16_mc33_8_c: 1407.3 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 277.6 ( 5.07x) avg_h264_qpel_16_mc33_8_ssse3: 179.9 ( 7.82x) put_h264_qpel_8_mc11_8_c: 348.1 ( 1.00x) put_h264_qpel_8_mc11_8_sse2: 69.1 ( 5.04x) put_h264_qpel_8_mc11_8_ssse3: 53.8 ( 6.47x) put_h264_qpel_8_mc13_8_c: 349.3 ( 1.00x) put_h264_qpel_8_mc13_8_sse2: 69.7 ( 5.01x) put_h264_qpel_8_mc13_8_ssse3: 53.7 ( 6.51x) put_h264_qpel_8_mc21_8_c: 398.5 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 115.0 ( 3.46x) put_h264_qpel_8_mc21_8_ssse3: 95.3 ( 4.18x) put_h264_qpel_8_mc23_8_c: 399.9 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 120.8 ( 3.31x) put_h264_qpel_8_mc23_8_ssse3: 95.4 ( 4.19x) put_h264_qpel_8_mc31_8_c: 350.4 ( 1.00x) put_h264_qpel_8_mc31_8_sse2: 69.6 ( 5.03x) put_h264_qpel_8_mc31_8_ssse3: 54.2 ( 6.47x) put_h264_qpel_8_mc33_8_c: 353.1 ( 1.00x) put_h264_qpel_8_mc33_8_sse2: 71.0 ( 4.97x) put_h264_qpel_8_mc33_8_ssse3: 54.2 ( 6.51x) put_h264_qpel_16_mc11_8_c: 1384.2 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 272.9 ( 5.07x) put_h264_qpel_16_mc11_8_ssse3: 178.3 ( 7.76x) put_h264_qpel_16_mc13_8_c: 1393.6 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 271.1 ( 5.14x) put_h264_qpel_16_mc13_8_ssse3: 178.3 ( 7.82x) put_h264_qpel_16_mc21_8_c: 1612.6 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 416.5 ( 3.87x) put_h264_qpel_16_mc21_8_ssse3: 289.1 ( 5.58x) put_h264_qpel_16_mc23_8_c: 1621.3 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 416.9 ( 3.89x) put_h264_qpel_16_mc23_8_ssse3: 289.4 ( 5.60x) put_h264_qpel_16_mc31_8_c: 1408.4 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 273.5 ( 5.15x) put_h264_qpel_16_mc31_8_ssse3: 176.9 ( 7.96x) put_h264_qpel_16_mc33_8_c: 1396.4 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 276.3 ( 5.05x) put_h264_qpel_16_mc33_8_ssse3: 176.4 ( 7.92x) New benchmarks: avg_h264_qpel_8_mc11_8_c: 352.1 ( 1.00x) avg_h264_qpel_8_mc11_8_sse2: 52.5 ( 6.71x) avg_h264_qpel_8_mc11_8_ssse3: 53.9 ( 6.54x) avg_h264_qpel_8_mc13_8_c: 350.8 ( 1.00x) avg_h264_qpel_8_mc13_8_sse2: 54.7 ( 6.42x) avg_h264_qpel_8_mc13_8_ssse3: 54.3 ( 6.46x) avg_h264_qpel_8_mc21_8_c: 400.1 ( 1.00x) avg_h264_qpel_8_mc21_8_sse2: 98.6 ( 4.06x) avg_h264_qpel_8_mc21_8_ssse3: 95.5 ( 4.19x) avg_h264_qpel_8_mc23_8_c: 400.4 ( 1.00x) avg_h264_qpel_8_mc23_8_sse2: 101.4 ( 3.95x) avg_h264_qpel_8_mc23_8_ssse3: 95.9 ( 4.18x) avg_h264_qpel_8_mc31_8_c: 352.4 ( 1.00x) avg_h264_qpel_8_mc31_8_sse2: 52.9 ( 6.67x) avg_h264_qpel_8_mc31_8_ssse3: 54.4 ( 6.48x) avg_h264_qpel_8_mc33_8_c: 354.5 ( 1.00x) avg_h264_qpel_8_mc33_8_sse2: 52.9 ( 6.70x) avg_h264_qpel_8_mc33_8_ssse3: 54.4 ( 6.52x) avg_h264_qpel_16_mc11_8_c: 1420.4 ( 1.00x) avg_h264_qpel_16_mc11_8_sse2: 204.8 ( 6.93x) avg_h264_qpel_16_mc11_8_ssse3: 177.9 ( 7.98x) avg_h264_qpel_16_mc13_8_c: 1409.8 ( 1.00x) avg_h264_qpel_16_mc13_8_sse2: 206.4 ( 6.83x) avg_h264_qpel_16_mc13_8_ssse3: 178.0 ( 7.92x) avg_h264_qpel_16_mc21_8_c: 1634.1 ( 1.00x) avg_h264_qpel_16_mc21_8_sse2: 349.6 ( 4.67x) avg_h264_qpel_16_mc21_8_ssse3: 290.0 ( 5.63x) avg_h264_qpel_16_mc23_8_c: 1624.1 ( 1.00x) avg_h264_qpel_16_mc23_8_sse2: 350.0 ( 4.64x) avg_h264_qpel_16_mc23_8_ssse3: 291.9 ( 5.56x) avg_h264_qpel_16_mc31_8_c: 1407.2 ( 1.00x) avg_h264_qpel_16_mc31_8_sse2: 205.8 ( 6.84x) avg_h264_qpel_16_mc31_8_ssse3: 178.2 ( 7.90x) avg_h264_qpel_16_mc33_8_c: 1400.5 ( 1.00x) avg_h264_qpel_16_mc33_8_sse2: 206.3 ( 6.79x) avg_h264_qpel_16_mc33_8_ssse3: 179.4 ( 7.81x) put_h264_qpel_8_mc11_8_c: 349.7 ( 1.00x) put_h264_qpel_8_mc11_8_sse2: 50.2 ( 6.96x) put_h264_qpel_8_mc11_8_ssse3: 51.3 ( 6.82x) put_h264_qpel_8_mc13_8_c: 349.8 ( 1.00x) put_h264_qpel_8_mc13_8_sse2: 50.7 ( 6.90x) put_h264_qpel_8_mc13_8_ssse3: 51.7 ( 6.76x) put_h264_qpel_8_mc21_8_c: 398.0 ( 1.00x) put_h264_qpel_8_mc21_8_sse2: 96.5 ( 4.13x) put_h264_qpel_8_mc21_8_ssse3: 92.3 ( 4.31x) put_h264_qpel_8_mc23_8_c: 401.4 ( 1.00x) put_h264_qpel_8_mc23_8_sse2: 102.3 ( 3.92x) put_h264_qpel_8_mc23_8_ssse3: 92.8 ( 4.32x) put_h264_qpel_8_mc31_8_c: 349.4 ( 1.00x) put_h264_qpel_8_mc31_8_sse2: 50.8 ( 6.88x) put_h264_qpel_8_mc31_8_ssse3: 51.8 ( 6.75x) put_h264_qpel_8_mc33_8_c: 351.1 ( 1.00x) put_h264_qpel_8_mc33_8_sse2: 52.2 ( 6.73x) put_h264_qpel_8_mc33_8_ssse3: 51.7 ( 6.79x) put_h264_qpel_16_mc11_8_c: 1391.1 ( 1.00x) put_h264_qpel_16_mc11_8_sse2: 196.6 ( 7.07x) put_h264_qpel_16_mc11_8_ssse3: 178.2 ( 7.81x) put_h264_qpel_16_mc13_8_c: 1385.2 ( 1.00x) put_h264_qpel_16_mc13_8_sse2: 195.6 ( 7.08x) put_h264_qpel_16_mc13_8_ssse3: 176.6 ( 7.84x) put_h264_qpel_16_mc21_8_c: 1607.5 ( 1.00x) put_h264_qpel_16_mc21_8_sse2: 341.0 ( 4.71x) put_h264_qpel_16_mc21_8_ssse3: 289.1 ( 5.56x) put_h264_qpel_16_mc23_8_c: 1616.7 ( 1.00x) put_h264_qpel_16_mc23_8_sse2: 340.8 ( 4.74x) put_h264_qpel_16_mc23_8_ssse3: 288.6 ( 5.60x) put_h264_qpel_16_mc31_8_c: 1397.6 ( 1.00x) put_h264_qpel_16_mc31_8_sse2: 197.3 ( 7.08x) put_h264_qpel_16_mc31_8_ssse3: 175.4 ( 7.97x) put_h264_qpel_16_mc33_8_c: 1394.3 ( 1.00x) put_h264_qpel_16_mc33_8_sse2: 197.7 ( 7.05x) put_h264_qpel_16_mc33_8_ssse3: 175.2 ( 7.96x) As can be seen, the SSE2 version is often neck-to-neck with the SSSE3 version (which also benefits from a better hv2_lowpass SSSE3 implementation for mc21 and mc23) for eight byte block sizes. Unsurprisingly, SSSE3 beats SSE2 for 16x16 blocks: For SSE2, these blocks are processed by calling the 8x8 function four times whereas SSSE3 has a dedicated function (on x64). This implementation should also be extendable to an AVX version for 16x16 blocks. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	fa9ea5113b	avcodec/x86/h264_qpel_8bit: Optimize branch away ff_{avg,put}_h264_qpel8or16_hv2_lowpass_ssse3() currently is almost the disjoint union of the codepaths for sizes 8 and 16. This size is a compile-time constant at every callsite. So split the function and avoid the runtime branch. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	400203c00c	avcodec/x86/h264_qpel: Remove unused parameter from hv2_lowpass funcs tmpstride is unused. This also allows to remove said parameter from lots of functions in h264_qpel.c. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	b84c818c83	avcodec/x86/h264_qpel: Remove constant parameters from shift5 funcs They are constant since the size 16 version is no longer emulated via the size 8 version. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00
Andreas Rheinhardt	810bd3e62a	avcodec/x86/h264_qpel: Add ff_{avg,put}_pixels16_l2_shift5_sse2 Up until now this function was emulated via two calls to ff_{avg,pull}_pixels8_l2_shift5_mmxext(). Adding a dedicated function proved beneficial both size wise and performance wise: The new functions take 192B, yet the simplified calls save 256B with GCC and 320B with Clang here. This change will also allow further optimizations. Old benchmarks: avg_h264_qpel_16_mc12_8_c: 1735.8 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 300.8 ( 5.77x) avg_h264_qpel_16_mc12_8_ssse3: 233.3 ( 7.44x) avg_h264_qpel_16_mc32_8_c: 1777.9 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 275.6 ( 6.45x) avg_h264_qpel_16_mc32_8_ssse3: 235.7 ( 7.54x) put_h264_qpel_16_mc12_8_c: 1808.2 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 267.2 ( 6.77x) put_h264_qpel_16_mc12_8_ssse3: 231.9 ( 7.80x) put_h264_qpel_16_mc32_8_c: 1766.9 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 272.9 ( 6.47x) put_h264_qpel_16_mc32_8_ssse3: 229.5 ( 7.70x) New benchmarks: avg_h264_qpel_16_mc12_8_c: 1742.3 ( 1.00x) avg_h264_qpel_16_mc12_8_sse2: 240.3 ( 7.25x) avg_h264_qpel_16_mc12_8_ssse3: 214.8 ( 8.11x) avg_h264_qpel_16_mc32_8_c: 1748.0 ( 1.00x) avg_h264_qpel_16_mc32_8_sse2: 238.0 ( 7.35x) avg_h264_qpel_16_mc32_8_ssse3: 209.2 ( 8.35x) put_h264_qpel_16_mc12_8_c: 2014.4 ( 1.00x) put_h264_qpel_16_mc12_8_sse2: 243.7 ( 8.27x) put_h264_qpel_16_mc12_8_ssse3: 211.5 ( 9.52x) put_h264_qpel_16_mc32_8_c: 1800.0 ( 1.00x) put_h264_qpel_16_mc32_8_sse2: 238.8 ( 7.54x) put_h264_qpel_16_mc32_8_ssse3: 206.7 ( 8.71x) Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2025-10-04 07:06:33 +02:00

1 2 3 4 5 ...

52871 Commits