FFmpeg

mirror of https://mirror.skon.top/https://github.com/FFmpeg/FFmpeg synced 2026-04-22 05:40:27 +08:00

Author	SHA1	Message	Date
Niklas Haas	a797e30f71	swscale/aarch64/ops: compute SWS_OP_PACK mask directly Instead of implicitly relying on SwsComps.unused, which contains the exact same information. (cf. ff_sws_op_list_update_comps) Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-16 23:25:17 +02:00
Niklas Haas	6d1e549195	swscale/aarch64/ops: use SWS_OP_NEEDED() instead of next->comps.unused These are basically identical, but the latter is being phased out. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-16 23:25:17 +02:00
Niklas Haas	18cc71fc8e	swscale/aarch64/ops: fix SWS_OP_LINEAR mask check The implementation of AARCH64_SWS_OP_LINEAR loops over elements of this mask to determine which output rows to compute. However, it is being set by this loop to `op->comps.unused`, which is a mask of unused input rows. As such, it should be looking at `next->comps.unused` instead. This did not result in problems in practice, because none of the linear matrices happened to trigger this case (more input columns than output rows). Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-16 23:25:17 +02:00
Niklas Haas	85bef2c2bc	swscale/ops: split SwsConst up into op-specific structs It was a bit clunky, lacked semantic contextual information, and made it harder to reason about the effects of extending this struct. There should be zero runtime overhead as a result of the fact that this is already a big union. I made the changes in this commit by hand, but due to the length and noise level of the commit, I used Opus 4.6 to verify that I did not accidentally introduce any bugs or typos. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-02 11:48:15 +00:00
Niklas Haas	32ba5c13de	swscale/ops_chain: split generic setup helpers into op-specific helpers This has the side benefit of not relying on the q2pixel macro to avoid division by zero, since we can now explicitly avoid operating on undefined clear values. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-02 11:48:15 +00:00
Ramiro Polla	53537f6cf5	swscale/aarch64: mark CPS kernel functions as indirect branch targets Only the process functions are entered via an indirect _call_ from C. The kernel functions and process_return are dispatched to by indirect _branches_ instead (continuation-passing style design). Make use of the recently added "jumpable" parameter to the function macro in libavutil/aarch64/asm.S to fix these functions when BTI is enabled. Sponsored-by: Sovereign Tech Fund Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>	2026-03-31 11:48:52 +00:00
Ramiro Polla	2517c328fc	swscale/aarch64: add NEON sws_ops backend This commit pieces together the previous few commits to implement the NEON backend for sws_ops. In essence, a tool which runs on the target (sws_ops_aarch64) is used to enumerate all the functions that the backend needs to implement. The list it generates is stored in the repository (ops_entries.c). The list from above is used at build time by a code generator tool (ops_asmgen) to implement all the sws_ops functions the NEON backend supports, and generate a lookup function in C to retrieve the assembly function pointers. At runtime, the NEON backend fetches the function pointers to the assembly functions and chains them together in a continuation-passing style design, similar to the x86 backend. The following speedup is observed from legacy swscale to NEON: A520: Overall speedup=3.780x faster, min=0.137x max=91.928x A720: Overall speedup=4.129x faster, min=0.234x max=92.424x And the following from the C sws_ops implementation to NEON: A520: Overall speedup=5.513x faster, min=0.927x max=14.169x A720: Overall speedup=4.786x faster, min=0.585x max=20.157x The slowdowns from legacy to NEON are the same for C/x86. Mostly low bit-depth conversions that did not perform dithering in legacy. The 0.585x outlier from C to NEON is gbrpf32le -> gbrapf32le, which is mostly memcpy with the C implementation. All other conversions are better. Sponsored-by: Sovereign Tech Fund Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>	2026-03-30 11:38:35 +00:00
Ramiro Polla	534757926f	swscale/aarch64: introduce ops_asmgen for NEON backend The NEON sws_ops backend follows the same continuation-passing style design as the x86 backend. Unlike the C and x86 backends, which implement the various operation functions through the use of templates and preprocessor macros, the NEON backend uses a build-time code generator, which is introduced by this commit. This code generator has two modes of operation: -ops: Generates an assembly file in GNU assembler syntax targeting AArch64, which implements all the sws_ops functions the NEON backend supports. -lookup: Generates a C function with a hierarchical condition chain that returns the pointer to one of the functions generated above, based on a given set of parameters derived from SwsOp. This is the core of the NEON sws_ops backend. Sponsored-by: Sovereign Tech Fund Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>	2026-03-30 11:38:35 +00:00
Ramiro Polla	991611536c	swscale/aarch64: introduce a runtime aarch64 assembler interface The runtime assembler interface provides an instruction-level IR and builder API tailored to the needs of the swscale dynamic pipeline. It is not meant to be a general purpose assembler interface. Currently only a static file backend, which emits GNU assembler text, has been implemented. In the future, this interface will be used to write functions dynamically at runtime. This code will be compiled both for runtime usage to generate optimized functions and for build-time usage to generate static assembly files. Therefore, it must not depend on internal FFmpeg libraries. Sponsored-by: Sovereign Tech Fund Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>	2026-03-30 11:38:35 +00:00
Ramiro Polla	a1bfaa0e78	swscale/aarch64: introduce tool to enumerate sws_ops for NEON backend The NEON sws_ops backend will use a build-time code generator for the various operation functions it needs to implement. This build time code generator (ops_asmgen) will need a list of the operations that must be implemented. This commit adds a tool (sws_ops_aarch64) that generates such a list (ops_entries.c). The list is generated by iterating over all possible conversion combinations and collecting the parameters for each NEON assembly function that has to be implemented, defined by an unique set of parameters derived from SwsOp. Whenever swscale evolves, with improved optimization passes, new pixel formats, or improvements to the backend itself, this file (ops_entries.c) should be regenerated by running: $ make sws_ops_entries_aarch64 Sponsored-by: Sovereign Tech Fund Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>	2026-03-30 11:38:35 +00:00
David Christle	2c7fe8d8ad	swscale/aarch64: add NEON rgb32tobgr24 and rgb24tobgr32 conversions Add NEON alpha drop/insert using ldp+tbl+stp instead of ld4/st3 and ld3/st4 structure operations. Both use a 2-register sliding-window tbl with post-indexed addressing. Instruction scheduling targets narrow in-order cores (A55) while remaining neutral on wide OoO. Scalar tails use coalesced loads/stores (ldr+strh+lsr+strb for alpha drop, ldrh+ldrb+orr+str for alpha insert) to reduce per-pixel instruction count. Independent instructions placed between loads and dependent operations to fill load-use latency on in-order cores. checkasm --bench on Apple M3 Max (decicycles, 1920px): rgb32tobgr24_c: 114.4 ( 1.00x) rgb32tobgr24_neon: 64.3 ( 1.78x) rgb24tobgr32_c: 128.9 ( 1.00x) rgb24tobgr32_neon: 80.9 ( 1.59x) C baseline is clang auto-vectorized; speedup is over compiler NEON. Signed-off-by: David Christle <dev@christle.is>	2026-03-04 10:30:08 +00:00
David Christle	ddd720ae61	swscale/aarch64: add NEON rgb24tobgr24 byte-swap Add a NEON rgb24tobgr24 using ld3/st3 to swap R and B channels in packed 24bpp RGB buffers. Handles all input sizes with a 16-pixel NEON fast path, 8-pixel NEON cleanup, and scalar tail. checkasm --bench on Apple M3 Max (1920*3 = 5760 bytes): rgb24tobgr24_c: 722.0 ( 1.00x) rgb24tobgr24_neon: 94.9 ( 7.61x) Signed-off-by: David Christle <dev@christle.is>	2026-03-04 10:30:08 +00:00
David Christle	7fab0becab	swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion Add ARM64 NEON-accelerated unscaled YUV-to-RGB conversion for planar YUV input formats. This extends the existing NV12/NV21 NEON paths with YUV420P, YUV422P, and YUVA420P support for all packed RGB output formats (ARGB, RGBA, ABGR, BGRA, RGB24, BGR24) and planar GBRP. Register with ff_yuv2rgb_init_aarch64() to also cover the scaled path. checkasm: all 42 sw_yuv2rgb tests pass. Speedup vs C at 1920px width (Apple M3 Max, avg of 20 runs): yuv420p->rgb24: 4.3x yuv420p->argb: 3.1x yuv422p->rgb24: 5.5x yuv422p->argb: 4.1x yuva420p->argb: 3.5x yuva420p->rgba: 3.5x Signed-off-by: David Christle <dev@christle.is>	2026-03-02 13:14:07 +00:00
Arpad Panyik	1f30ff30fb	swscale: Add AArch64 Neon path for xyz12Torgb48 LE Add optimized Neon code path for the little endian case of the xyz12Torgb48 function. The innermost loop processes the data in 4x2 pixel blocks using software gathers with the matrix multiplication and clipping done by Neon. Relative runtime of micro benchmarks after this patch on some Cortex and Neoverse CPU cores: xyz12le_rgb48le X1 X3 X4 X925 V2 16x4_neon: 2.55x 4.34x 3.84x 3.31x 3.22x 32x4_neon: 2.39x 3.63x 3.22x 3.35x 3.29x 64x4_neon: 2.37x 3.31x 2.91x 3.33x 3.27x 128x4_neon: 2.34x 3.28x 2.91x 3.35x 3.24x 256x4_neon: 2.30x 3.17x 2.91x 3.32x 3.10x 512x4_neon: 2.26x 3.10x 2.91x 3.30x 3.07x 1024x4_neon: 2.26x 3.07x 2.96x 3.30x 3.05x 1920x4_neon: 2.26x 3.06x 2.93x 3.28x 3.04x xyz12le_rgb48le A76 A78 A715 A720 A725 16x4_neon: 2.33x 2.28x 2.53x 3.33x 3.19x 32x4_neon: 2.35x 2.18x 2.45x 3.23x 3.24x 64x4_neon: 2.35x 2.16x 2.42x 3.15x 3.21x 128x4_neon: 2.35x 2.13x 2.39x 3.00x 3.09x 256x4_neon: 2.36x 2.12x 2.35x 2.85x 2.99x 512x4_neon: 2.35x 2.14x 2.35x 2.78x 2.95x 1024x4_neon: 2.31x 2.09x 2.33x 2.80x 2.91x 1920x4_neon: 2.30x 2.07x 2.32x 2.81x 2.94x xyz12le_rgb48le A55 A510 A520 16x4_neon: 2.09x 1.92x 2.36x 32x4_neon: 2.05x 1.89x 2.38x 64x4_neon: 2.02x 1.77x 2.35x 128x4_neon: 1.96x 1.74x 2.25x 256x4_neon: 1.90x 1.72x 2.19x 512x4_neon: 1.83x 1.75x 2.16x 1024x4_neon: 1.83x 1.62x 2.15x 1920x4_neon: 1.82x 1.60x 2.15x Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>	2025-12-05 10:28:18 +00:00
Dash Santosh	ca2a88c1b3	swscale/output: Implement yuv2nv12cx neon assembly yuv2nv12cX_2_512_accurate_c: 3540.1 ( 1.00x) yuv2nv12cX_2_512_accurate_neon: 408.0 ( 8.68x) yuv2nv12cX_2_512_approximate_c: 3521.4 ( 1.00x) yuv2nv12cX_2_512_approximate_neon: 409.2 ( 8.61x) yuv2nv12cX_4_512_accurate_c: 4740.0 ( 1.00x) yuv2nv12cX_4_512_accurate_neon: 604.4 ( 7.84x) yuv2nv12cX_4_512_approximate_c: 4681.9 ( 1.00x) yuv2nv12cX_4_512_approximate_neon: 603.3 ( 7.76x) yuv2nv12cX_8_512_accurate_c: 7273.1 ( 1.00x) yuv2nv12cX_8_512_accurate_neon: 1012.2 ( 7.19x) yuv2nv12cX_8_512_approximate_c: 7223.0 ( 1.00x) yuv2nv12cX_8_512_approximate_neon: 1015.8 ( 7.11x) yuv2nv12cX_16_512_accurate_c: 13762.0 ( 1.00x) yuv2nv12cX_16_512_accurate_neon: 1761.4 ( 7.81x) yuv2nv12cX_16_512_approximate_c: 13884.0 ( 1.00x) yuv2nv12cX_16_512_approximate_neon: 1766.8 ( 7.86x) Benchmarked on: Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU 3417 Mhz, 12 Core(s), 12 Logical Processor(s)	2025-08-12 09:05:00 +00:00
Logaprakash Ramajayam	49477972b7	swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() yuv2yuvX_8_2_0_512_accurate_c: 2213.4 ( 1.00x) yuv2yuvX_8_2_0_512_accurate_neon: 147.5 (15.01x) yuv2yuvX_8_2_0_512_approximate_c: 2203.9 ( 1.00x) yuv2yuvX_8_2_0_512_approximate_neon: 154.1 (14.30x) yuv2yuvX_8_2_16_512_accurate_c: 2147.2 ( 1.00x) yuv2yuvX_8_2_16_512_accurate_neon: 150.8 (14.24x) yuv2yuvX_8_2_16_512_approximate_c: 2149.7 ( 1.00x) yuv2yuvX_8_2_16_512_approximate_neon: 146.8 (14.64x) yuv2yuvX_8_2_32_512_accurate_c: 2078.9 ( 1.00x) yuv2yuvX_8_2_32_512_accurate_neon: 139.0 (14.95x) yuv2yuvX_8_2_32_512_approximate_c: 2083.7 ( 1.00x) yuv2yuvX_8_2_32_512_approximate_neon: 140.5 (14.84x) yuv2yuvX_8_2_48_512_accurate_c: 2010.7 ( 1.00x) yuv2yuvX_8_2_48_512_accurate_neon: 138.2 (14.55x) yuv2yuvX_8_2_48_512_approximate_c: 2012.6 ( 1.00x) yuv2yuvX_8_2_48_512_approximate_neon: 141.2 (14.26x) yuv2yuvX_10LE_16_0_512_accurate_c: 7874.1 ( 1.00x) yuv2yuvX_10LE_16_0_512_accurate_neon: 831.6 ( 9.47x) yuv2yuvX_10LE_16_0_512_approximate_c: 7918.1 ( 1.00x) yuv2yuvX_10LE_16_0_512_approximate_neon: 836.1 ( 9.47x) yuv2yuvX_10LE_16_16_512_accurate_c: 7630.9 ( 1.00x) yuv2yuvX_10LE_16_16_512_accurate_neon: 804.5 ( 9.49x) yuv2yuvX_10LE_16_16_512_approximate_c: 7724.7 ( 1.00x) yuv2yuvX_10LE_16_16_512_approximate_neon: 808.6 ( 9.55x) yuv2yuvX_10LE_16_32_512_accurate_c: 7436.4 ( 1.00x) yuv2yuvX_10LE_16_32_512_accurate_neon: 780.4 ( 9.53x) yuv2yuvX_10LE_16_32_512_approximate_c: 7366.7 ( 1.00x) yuv2yuvX_10LE_16_32_512_approximate_neon: 780.5 ( 9.44x) yuv2yuvX_10LE_16_48_512_accurate_c: 7099.9 ( 1.00x) yuv2yuvX_10LE_16_48_512_accurate_neon: 761.0 ( 9.33x) yuv2yuvX_10LE_16_48_512_approximate_c: 7097.6 ( 1.00x) yuv2yuvX_10LE_16_48_512_approximate_neon: 754.6 ( 9.41x) Benchmarked on: Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU 3417 Mhz, 12 Core(s), 12 Logical Processor(s)	2025-08-12 09:05:00 +00:00
Timo Rothenpieler	262d41c804	all: fix typos found by codespell	2025-08-03 13:48:47 +02:00
Martin Storsjö	73f4668ef8	swscale: aarch64: Simplify the assignment of lumToYV12 We normally don't need else statements here; the common pattern is to assign lower level SIMD implementations first, then conditionally reassign higher level ones afterwards, if supported. Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-10 14:03:58 +02:00
Krzysztof Pyrkosz	d765e5f043	swscale/aarch64: dotprod implementation of rgba32_to_Y The idea is to split the 16 bit coefficients into lower and upper half, invoke udot for the lower half, shift by 8, and follow by udot for the upper half. Benchmark on A78: bgra_to_y_128_c: 682.0 ( 1.00x) bgra_to_y_128_neon: 181.2 ( 3.76x) bgra_to_y_128_dotprod: 117.8 ( 5.79x) bgra_to_y_1080_c: 5742.5 ( 1.00x) bgra_to_y_1080_neon: 1472.5 ( 3.90x) bgra_to_y_1080_dotprod: 906.5 ( 6.33x) bgra_to_y_1920_c: 10194.0 ( 1.00x) bgra_to_y_1920_neon: 2589.8 ( 3.94x) bgra_to_y_1920_dotprod: 1573.8 ( 6.48x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-04 10:16:44 +02:00
Krzysztof Pyrkosz	38929b824b	swscale/aarch64: Refactor hscale_16_to_15__fs_4 This patch removes the use of stack for temporary state and replaces interleaved ld4 loads with ld1. Before/after: A78 hscale_16_to_15__fs_4_dstW_8_neon: 86.8 ( 1.72x) hscale_16_to_15__fs_4_dstW_24_neon: 147.5 ( 2.73x) hscale_16_to_15__fs_4_dstW_128_neon: 614.0 ( 3.14x) hscale_16_to_15__fs_4_dstW_144_neon: 680.5 ( 3.18x) hscale_16_to_15__fs_4_dstW_256_neon: 1193.2 ( 3.19x) hscale_16_to_15__fs_4_dstW_512_neon: 2305.0 ( 3.27x) hscale_16_to_15__fs_4_dstW_8_neon: 86.0 ( 1.74x) hscale_16_to_15__fs_4_dstW_24_neon: 106.8 ( 3.78x) hscale_16_to_15__fs_4_dstW_128_neon: 404.0 ( 4.81x) hscale_16_to_15__fs_4_dstW_144_neon: 451.8 ( 4.80x) hscale_16_to_15__fs_4_dstW_256_neon: 760.5 ( 5.06x) hscale_16_to_15__fs_4_dstW_512_neon: 1520.0 ( 5.01x) A72 hscale_16_to_15__fs_4_dstW_8_neon: 156.8 ( 1.52x) hscale_16_to_15__fs_4_dstW_24_neon: 217.8 ( 2.52x) hscale_16_to_15__fs_4_dstW_128_neon: 906.8 ( 2.90x) hscale_16_to_15__fs_4_dstW_144_neon: 1014.5 ( 2.91x) hscale_16_to_15__fs_4_dstW_256_neon: 1751.5 ( 2.96x) hscale_16_to_15__fs_4_dstW_512_neon: 3469.3 ( 2.97x) hscale_16_to_15__fs_4_dstW_8_neon: 151.2 ( 1.54x) hscale_16_to_15__fs_4_dstW_24_neon: 173.4 ( 3.15x) hscale_16_to_15__fs_4_dstW_128_neon: 660.0 ( 3.98x) hscale_16_to_15__fs_4_dstW_144_neon: 735.7 ( 4.00x) hscale_16_to_15__fs_4_dstW_256_neon: 1273.5 ( 4.09x) hscale_16_to_15__fs_4_dstW_512_neon: 2488.2 ( 4.16x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-02 01:17:29 +02:00
Martin Storsjö	b137347278	aarch64: Fix a few misindented lines Signed-off-by: Martin Storsjö <martin@martin.st>	2025-02-28 23:23:09 +02:00
Krzysztof Pyrkosz	b92577405b	swscale/aarch64/rgb2rgb_neon: Implemented {yuyv, uyvy}toyuv{420, 422} A78: uyvytoyuv420_neon: 6112.5 ( 6.96x) uyvytoyuv422_neon: 6696.0 ( 6.32x) yuyvtoyuv420_neon: 6113.0 ( 6.95x) yuyvtoyuv422_neon: 6695.2 ( 6.31x) A72: uyvytoyuv420_neon: 9512.1 ( 6.09x) uyvytoyuv422_neon: 9766.8 ( 6.32x) yuyvtoyuv420_neon: 9639.1 ( 6.00x) yuyvtoyuv422_neon: 9779.0 ( 6.03x) A53: uyvytoyuv420_neon: 12720.1 ( 9.10x) uyvytoyuv422_neon: 14282.9 ( 6.71x) yuyvtoyuv420_neon: 12637.4 ( 9.15x) yuyvtoyuv422_neon: 14127.6 ( 6.77x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-02-17 11:39:42 +02:00
Krzysztof Pyrkosz	64107e22f5	swscale/aarch64/rgb24toyv12: skip early right shift by 2 It's a minor improvement that shaves off 5-8% from the execution time. Instead of shifting by 2 right away and by 7 soon after, shift by 9 one time. Times before and after: A78: rgb24toyv12_16_200_neon: 5366.8 ( 3.62x) rgb24toyv12_128_60_neon: 13574.0 ( 3.34x) rgb24toyv12_512_16_neon: 14463.8 ( 3.33x) rgb24toyv12_1920_4_neon: 13508.2 ( 3.34x) rgb24toyv12_1920_4_negstride_neon: 13525.0 ( 3.34x) rgb24toyv12_16_200_neon: 5293.8 ( 3.66x) rgb24toyv12_128_60_neon: 12955.0 ( 3.50x) rgb24toyv12_512_16_neon: 13784.0 ( 3.50x) rgb24toyv12_1920_4_neon: 12900.8 ( 3.49x) rgb24toyv12_1920_4_negstride_neon: 12902.8 ( 3.49x) A72: rgb24toyv12_16_200_neon: 9695.8 ( 2.50x) rgb24toyv12_128_60_neon: 20286.6 ( 2.70x) rgb24toyv12_512_16_neon: 22276.6 ( 2.57x) rgb24toyv12_1920_4_neon: 19154.1 ( 2.77x) rgb24toyv12_1920_4_negstride_neon: 19055.1 ( 2.78x) rgb24toyv12_16_200_neon: 9214.8 ( 2.65x) rgb24toyv12_128_60_neon: 20731.5 ( 2.65x) rgb24toyv12_512_16_neon: 21145.0 ( 2.70x) rgb24toyv12_1920_4_neon: 17586.5 ( 2.99x) rgb24toyv12_1920_4_negstride_neon: 17571.0 ( 2.98x) A53: rgb24toyv12_16_200_neon: 12880.4 ( 3.76x) rgb24toyv12_128_60_neon: 27776.3 ( 3.94x) rgb24toyv12_512_16_neon: 29411.3 ( 3.94x) rgb24toyv12_1920_4_neon: 27253.1 ( 3.98x) rgb24toyv12_1920_4_negstride_neon: 27474.3 ( 3.95x) rgb24toyv12_16_200_neon: 12196.3 ( 3.95x) rgb24toyv12_128_60_neon: 26943.1 ( 4.07x) rgb24toyv12_512_16_neon: 28642.3 ( 4.07x) rgb24toyv12_1920_4_neon: 26676.6 ( 4.08x) rgb24toyv12_1920_4_negstride_neon: 26713.8 ( 4.07x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-02-17 10:49:41 +02:00
Krzysztof Pyrkosz	c85a748979	swscale/aarch64/rgb2rgb: Implemented NEON shuf routines The key idea is to pass the pre-generated tables to the TBL instruction and churn through the data 16 bytes at a time. The remaining 4 elements are handled with a specialized block located at the end of the routine. The 3210 variant can be implemented using rev32, but surprisingly it is slower than the generic TBL on A78, but much faster on A72. There may be some room for improvement. Possibly instead of handling last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2] and process along with the last 8 bytes. Speeds measured with checkasm --test=sw_rgb --bench --runs=10 \| grep shuf - A78 shuffle_bytes_0321_c: 75.5 ( 1.00x) shuffle_bytes_0321_neon: 26.5 ( 2.85x) shuffle_bytes_1203_c: 136.2 ( 1.00x) shuffle_bytes_1203_neon: 27.2 ( 5.00x) shuffle_bytes_1230_c: 135.5 ( 1.00x) shuffle_bytes_1230_neon: 28.0 ( 4.84x) shuffle_bytes_2013_c: 138.8 ( 1.00x) shuffle_bytes_2013_neon: 22.0 ( 6.31x) shuffle_bytes_2103_c: 76.5 ( 1.00x) shuffle_bytes_2103_neon: 20.5 ( 3.73x) shuffle_bytes_2130_c: 137.5 ( 1.00x) shuffle_bytes_2130_neon: 28.0 ( 4.91x) shuffle_bytes_3012_c: 138.2 ( 1.00x) shuffle_bytes_3012_neon: 21.5 ( 6.43x) shuffle_bytes_3102_c: 138.2 ( 1.00x) shuffle_bytes_3102_neon: 27.2 ( 5.07x) shuffle_bytes_3210_c: 138.0 ( 1.00x) shuffle_bytes_3210_neon: 22.0 ( 6.27x) shuf3210 using rev32 shuffle_bytes_3210_c: 139.0 ( 1.00x) shuffle_bytes_3210_neon: 28.5 ( 4.88x) - A72 shuffle_bytes_0321_c: 120.0 ( 1.00x) shuffle_bytes_0321_neon: 36.0 ( 3.33x) shuffle_bytes_1203_c: 188.2 ( 1.00x) shuffle_bytes_1203_neon: 37.8 ( 4.99x) shuffle_bytes_1230_c: 195.0 ( 1.00x) shuffle_bytes_1230_neon: 36.0 ( 5.42x) shuffle_bytes_2013_c: 195.8 ( 1.00x) shuffle_bytes_2013_neon: 43.5 ( 4.50x) shuffle_bytes_2103_c: 117.2 ( 1.00x) shuffle_bytes_2103_neon: 53.5 ( 2.19x) shuffle_bytes_2130_c: 203.2 ( 1.00x) shuffle_bytes_2130_neon: 37.8 ( 5.38x) shuffle_bytes_3012_c: 183.8 ( 1.00x) shuffle_bytes_3012_neon: 46.8 ( 3.93x) shuffle_bytes_3102_c: 180.8 ( 1.00x) shuffle_bytes_3102_neon: 37.8 ( 4.79x) shuffle_bytes_3210_c: 195.8 ( 1.00x) shuffle_bytes_3210_neon: 37.8 ( 5.19x) shuf3210 using rev32 shuffle_bytes_3210_c: 194.8 ( 1.00x) shuffle_bytes_3210_neon: 30.8 ( 6.33x) - x13s: shuffle_bytes_0321_c: 49.4 ( 1.00x) shuffle_bytes_0321_neon: 18.1 ( 2.72x) shuffle_bytes_1203_c: 98.4 ( 1.00x) shuffle_bytes_1203_neon: 18.4 ( 5.35x) shuffle_bytes_1230_c: 97.4 ( 1.00x) shuffle_bytes_1230_neon: 19.1 ( 5.09x) shuffle_bytes_2013_c: 101.4 ( 1.00x) shuffle_bytes_2013_neon: 16.9 ( 6.01x) shuffle_bytes_2103_c: 53.9 ( 1.00x) shuffle_bytes_2103_neon: 13.9 ( 3.88x) shuffle_bytes_2130_c: 100.9 ( 1.00x) shuffle_bytes_2130_neon: 19.1 ( 5.27x) shuffle_bytes_3012_c: 97.4 ( 1.00x) shuffle_bytes_3012_neon: 17.1 ( 5.69x) shuffle_bytes_3102_c: 100.9 ( 1.00x) shuffle_bytes_3102_neon: 19.1 ( 5.27x) shuffle_bytes_3210_c: 100.6 ( 1.00x) shuffle_bytes_3210_neon: 16.9 ( 5.96x) shuf3210 using rev32 shuffle_bytes_3210_c: 100.6 ( 1.00x) shuffle_bytes_3210_neon: 18.6 ( 5.40x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-02-07 12:54:55 +02:00
Krzysztof Pyrkosz	e25a19fc7c	swscale/aarch64/output.S: refactor ff_yuv2plane1_8_neon The benchmarks (before vs after) were gathered using ./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 \| grep yuv2yuv1 A78 before: yuv2yuv1_0_512_accurate_c: 2039.5 ( 1.00x) yuv2yuv1_0_512_accurate_neon: 385.5 ( 5.29x) yuv2yuv1_0_512_approximate_c: 2110.5 ( 1.00x) yuv2yuv1_0_512_approximate_neon: 385.5 ( 5.47x) yuv2yuv1_3_512_accurate_c: 2061.2 ( 1.00x) yuv2yuv1_3_512_accurate_neon: 381.2 ( 5.41x) yuv2yuv1_3_512_approximate_c: 2099.2 ( 1.00x) yuv2yuv1_3_512_approximate_neon: 381.2 ( 5.51x) yuv2yuv1_8_512_accurate_c: 2054.2 ( 1.00x) yuv2yuv1_8_512_accurate_neon: 385.5 ( 5.33x) yuv2yuv1_8_512_approximate_c: 2112.2 ( 1.00x) yuv2yuv1_8_512_approximate_neon: 385.5 ( 5.48x) yuv2yuv1_11_512_accurate_c: 2036.0 ( 1.00x) yuv2yuv1_11_512_accurate_neon: 381.2 ( 5.34x) yuv2yuv1_11_512_approximate_c: 2115.0 ( 1.00x) yuv2yuv1_11_512_approximate_neon: 381.2 ( 5.55x) yuv2yuv1_16_512_accurate_c: 2066.5 ( 1.00x) yuv2yuv1_16_512_accurate_neon: 385.5 ( 5.36x) yuv2yuv1_16_512_approximate_c: 2100.8 ( 1.00x) yuv2yuv1_16_512_approximate_neon: 385.5 ( 5.45x) yuv2yuv1_19_512_accurate_c: 2059.8 ( 1.00x) yuv2yuv1_19_512_accurate_neon: 381.2 ( 5.40x) yuv2yuv1_19_512_approximate_c: 2102.8 ( 1.00x) yuv2yuv1_19_512_approximate_neon: 381.2 ( 5.52x) After: yuv2yuv1_0_512_accurate_c: 2206.0 ( 1.00x) yuv2yuv1_0_512_accurate_neon: 139.2 (15.84x) yuv2yuv1_0_512_approximate_c: 2050.0 ( 1.00x) yuv2yuv1_0_512_approximate_neon: 139.2 (14.72x) yuv2yuv1_3_512_accurate_c: 2205.2 ( 1.00x) yuv2yuv1_3_512_accurate_neon: 138.0 (15.98x) yuv2yuv1_3_512_approximate_c: 2052.5 ( 1.00x) yuv2yuv1_3_512_approximate_neon: 138.0 (14.87x) yuv2yuv1_8_512_accurate_c: 2171.0 ( 1.00x) yuv2yuv1_8_512_accurate_neon: 139.2 (15.59x) yuv2yuv1_8_512_approximate_c: 2064.2 ( 1.00x) yuv2yuv1_8_512_approximate_neon: 139.2 (14.82x) yuv2yuv1_11_512_accurate_c: 2164.8 ( 1.00x) yuv2yuv1_11_512_accurate_neon: 138.0 (15.69x) yuv2yuv1_11_512_approximate_c: 2048.8 ( 1.00x) yuv2yuv1_11_512_approximate_neon: 138.0 (14.85x) yuv2yuv1_16_512_accurate_c: 2154.5 ( 1.00x) yuv2yuv1_16_512_accurate_neon: 139.2 (15.47x) yuv2yuv1_16_512_approximate_c: 2047.2 ( 1.00x) yuv2yuv1_16_512_approximate_neon: 139.2 (14.70x) yuv2yuv1_19_512_accurate_c: 2144.5 ( 1.00x) yuv2yuv1_19_512_accurate_neon: 138.0 (15.54x) yuv2yuv1_19_512_approximate_c: 2046.0 ( 1.00x) yuv2yuv1_19_512_approximate_neon: 138.0 (14.83x) A72 before: yuv2yuv1_0_512_accurate_c: 3779.8 ( 1.00x) yuv2yuv1_0_512_accurate_neon: 527.8 ( 7.16x) yuv2yuv1_0_512_approximate_c: 4128.2 ( 1.00x) yuv2yuv1_0_512_approximate_neon: 528.2 ( 7.81x) yuv2yuv1_3_512_accurate_c: 3836.2 ( 1.00x) yuv2yuv1_3_512_accurate_neon: 527.0 ( 7.28x) yuv2yuv1_3_512_approximate_c: 3991.0 ( 1.00x) yuv2yuv1_3_512_approximate_neon: 526.8 ( 7.58x) yuv2yuv1_8_512_accurate_c: 3732.8 ( 1.00x) yuv2yuv1_8_512_accurate_neon: 525.5 ( 7.10x) yuv2yuv1_8_512_approximate_c: 4060.0 ( 1.00x) yuv2yuv1_8_512_approximate_neon: 527.0 ( 7.70x) yuv2yuv1_11_512_accurate_c: 3836.2 ( 1.00x) yuv2yuv1_11_512_accurate_neon: 530.0 ( 7.24x) yuv2yuv1_11_512_approximate_c: 4014.0 ( 1.00x) yuv2yuv1_11_512_approximate_neon: 530.0 ( 7.57x) yuv2yuv1_16_512_accurate_c: 3726.2 ( 1.00x) yuv2yuv1_16_512_accurate_neon: 525.5 ( 7.09x) yuv2yuv1_16_512_approximate_c: 4114.2 ( 1.00x) yuv2yuv1_16_512_approximate_neon: 526.2 ( 7.82x) yuv2yuv1_19_512_accurate_c: 3812.2 ( 1.00x) yuv2yuv1_19_512_accurate_neon: 530.0 ( 7.19x) yuv2yuv1_19_512_approximate_c: 4012.2 ( 1.00x) yuv2yuv1_19_512_approximate_neon: 530.0 ( 7.57x) After: yuv2yuv1_0_512_accurate_c: 3716.8 ( 1.00x) yuv2yuv1_0_512_accurate_neon: 215.1 (17.28x) yuv2yuv1_0_512_approximate_c: 3877.8 ( 1.00x) yuv2yuv1_0_512_approximate_neon: 222.8 (17.40x) yuv2yuv1_3_512_accurate_c: 3717.1 ( 1.00x) yuv2yuv1_3_512_accurate_neon: 217.8 (17.06x) yuv2yuv1_3_512_approximate_c: 3801.6 ( 1.00x) yuv2yuv1_3_512_approximate_neon: 220.3 (17.25x) yuv2yuv1_8_512_accurate_c: 3716.6 ( 1.00x) yuv2yuv1_8_512_accurate_neon: 213.8 (17.38x) yuv2yuv1_8_512_approximate_c: 3831.8 ( 1.00x) yuv2yuv1_8_512_approximate_neon: 218.1 (17.57x) yuv2yuv1_11_512_accurate_c: 3717.1 ( 1.00x) yuv2yuv1_11_512_accurate_neon: 219.1 (16.97x) yuv2yuv1_11_512_approximate_c: 3801.6 ( 1.00x) yuv2yuv1_11_512_approximate_neon: 216.1 (17.59x) yuv2yuv1_16_512_accurate_c: 3716.6 ( 1.00x) yuv2yuv1_16_512_accurate_neon: 213.6 (17.40x) yuv2yuv1_16_512_approximate_c: 3831.6 ( 1.00x) yuv2yuv1_16_512_approximate_neon: 215.1 (17.82x) yuv2yuv1_19_512_accurate_c: 3717.1 ( 1.00x) yuv2yuv1_19_512_accurate_neon: 223.8 (16.61x) yuv2yuv1_19_512_approximate_c: 3801.6 ( 1.00x) yuv2yuv1_19_512_approximate_neon: 219.1 (17.35x) x13s before: yuv2yuv1_0_512_accurate_c: 1435.1 ( 1.00x) yuv2yuv1_0_512_accurate_neon: 221.1 ( 6.49x) yuv2yuv1_0_512_approximate_c: 1405.4 ( 1.00x) yuv2yuv1_0_512_approximate_neon: 219.1 ( 6.41x) yuv2yuv1_3_512_accurate_c: 1418.6 ( 1.00x) yuv2yuv1_3_512_accurate_neon: 215.9 ( 6.57x) yuv2yuv1_3_512_approximate_c: 1405.9 ( 1.00x) yuv2yuv1_3_512_approximate_neon: 224.1 ( 6.27x) yuv2yuv1_8_512_accurate_c: 1433.9 ( 1.00x) yuv2yuv1_8_512_accurate_neon: 218.6 ( 6.56x) yuv2yuv1_8_512_approximate_c: 1412.9 ( 1.00x) yuv2yuv1_8_512_approximate_neon: 218.9 ( 6.46x) yuv2yuv1_11_512_accurate_c: 1449.1 ( 1.00x) yuv2yuv1_11_512_accurate_neon: 217.6 ( 6.66x) yuv2yuv1_11_512_approximate_c: 1410.9 ( 1.00x) yuv2yuv1_11_512_approximate_neon: 221.1 ( 6.38x) yuv2yuv1_16_512_accurate_c: 1402.1 ( 1.00x) yuv2yuv1_16_512_accurate_neon: 214.6 ( 6.53x) yuv2yuv1_16_512_approximate_c: 1422.4 ( 1.00x) yuv2yuv1_16_512_approximate_neon: 222.9 ( 6.38x) yuv2yuv1_19_512_accurate_c: 1421.6 ( 1.00x) yuv2yuv1_19_512_accurate_neon: 217.4 ( 6.54x) yuv2yuv1_19_512_approximate_c: 1421.6 ( 1.00x) yuv2yuv1_19_512_approximate_neon: 221.4 ( 6.42x) After: yuv2yuv1_0_512_accurate_c: 1413.6 ( 1.00x) yuv2yuv1_0_512_accurate_neon: 80.6 (17.53x) yuv2yuv1_0_512_approximate_c: 1455.6 ( 1.00x) yuv2yuv1_0_512_approximate_neon: 80.6 (18.05x) yuv2yuv1_3_512_accurate_c: 1429.1 ( 1.00x) yuv2yuv1_3_512_accurate_neon: 77.4 (18.47x) yuv2yuv1_3_512_approximate_c: 1462.6 ( 1.00x) yuv2yuv1_3_512_approximate_neon: 80.6 (18.14x) yuv2yuv1_8_512_accurate_c: 1425.4 ( 1.00x) yuv2yuv1_8_512_accurate_neon: 77.9 (18.30x) yuv2yuv1_8_512_approximate_c: 1436.6 ( 1.00x) yuv2yuv1_8_512_approximate_neon: 80.9 (17.76x) yuv2yuv1_11_512_accurate_c: 1429.4 ( 1.00x) yuv2yuv1_11_512_accurate_neon: 76.1 (18.78x) yuv2yuv1_11_512_approximate_c: 1447.1 ( 1.00x) yuv2yuv1_11_512_approximate_neon: 78.4 (18.46x) yuv2yuv1_16_512_accurate_c: 1439.9 ( 1.00x) yuv2yuv1_16_512_accurate_neon: 77.6 (18.55x) yuv2yuv1_16_512_approximate_c: 1422.1 ( 1.00x) yuv2yuv1_16_512_approximate_neon: 78.1 (18.20x) yuv2yuv1_19_512_accurate_c: 1447.1 ( 1.00x) yuv2yuv1_19_512_accurate_neon: 78.1 (18.52x) yuv2yuv1_19_512_approximate_c: 1474.4 ( 1.00x) yuv2yuv1_19_512_approximate_neon: 78.1 (18.87x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-02-07 12:05:06 +02:00
Ramiro Polla	ca889b1328	swscale/aarch64: add neon {lum,chr}ConvertRange16 aarch64 A55: chrRangeFromJpeg16_1920_c: 32684.2 chrRangeFromJpeg16_1920_neon: 8431.2 (3.88x) chrRangeToJpeg16_1920_c: 24996.8 chrRangeToJpeg16_1920_neon: 9395.0 (2.66x) lumRangeFromJpeg16_1920_c: 17305.2 lumRangeFromJpeg16_1920_neon: 4586.5 (3.77x) lumRangeToJpeg16_1920_c: 21144.8 lumRangeToJpeg16_1920_neon: 5069.8 (4.17x) aarch64 A76: chrRangeFromJpeg16_1920_c: 11523.8 chrRangeFromJpeg16_1920_neon: 3367.5 (3.42x) chrRangeToJpeg16_1920_c: 11655.2 chrRangeToJpeg16_1920_neon: 4087.2 (2.85x) lumRangeFromJpeg16_1920_c: 5762.0 lumRangeFromJpeg16_1920_neon: 1815.8 (3.17x) lumRangeToJpeg16_1920_c: 5946.2 lumRangeToJpeg16_1920_neon: 2148.2 (2.77x)	2024-12-05 21:10:29 +01:00
Ramiro Polla	6fe4a4ffb6	swscale/aarch64/range_convert: update neon range_convert functions to new API aarch64 A55: chrRangeFromJpeg8_1920_c: 28835.2 (1.00x) chrRangeFromJpeg8_1920_neon: 5313.9 (5.43x) 5308.4 (5.43x) chrRangeToJpeg8_1920_c: 23074.7 (1.00x) chrRangeToJpeg8_1920_neon: 5551.3 (4.16x) 5549.2 (4.16x) lumRangeFromJpeg8_1920_c: 15389.7 (1.00x) lumRangeFromJpeg8_1920_neon: 3152.3 (4.88x) 3147.7 (4.89x) lumRangeToJpeg8_1920_c: 19227.8 (1.00x) lumRangeToJpeg8_1920_neon: 3628.7 (5.30x) 3630.2 (5.30x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6324.4 (1.00x) chrRangeFromJpeg8_1920_neon: 2344.5 (2.70x) 2304.2 (2.74x) chrRangeToJpeg8_1920_c: 9656.0 (1.00x) chrRangeToJpeg8_1920_neon: 2824.2 (3.42x) 2794.2 (3.46x) lumRangeFromJpeg8_1920_c: 4422.0 (1.00x) lumRangeFromJpeg8_1920_neon: 1104.5 (4.00x) 1106.2 (4.00x) lumRangeToJpeg8_1920_c: 5949.1 (1.00x) lumRangeToJpeg8_1920_neon: 1329.8 (4.47x) 1328.2 (4.48x)	2024-12-05 21:10:29 +01:00
Ramiro Polla	384fe39623	swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats There is an issue with the constants used in YUV to YUV range conversion, where the upper bound is not respected when converting to mpeg range. With this commit, the constants are calculated at runtime, depending on the bit depth. This approach also allows us to more easily understand how the constants are derived. For bit depths <= 14, the number of fixed point bits has been set to 14 for all conversions, to simplify the code. For bit depths > 14, the number of fixed points bits has been raised and set to 18, to allow for the conversion to be accurate enough for the mpeg range to be respected. The convert functions now take the conversion constants (coeff and offset) as function arguments. For bit depths <= 14, coeff is unsigned 16-bit and offset is 32-bit. For bit depths > 14, coeff is unsigned 32-bit and offset is 64-bit. x86_64: chrRangeFromJpeg8_1920_c: 2127.4 2125.0 (1.00x) chrRangeFromJpeg16_1920_c: 2325.2 2127.2 (1.09x) chrRangeToJpeg8_1920_c: 3166.9 3168.7 (1.00x) chrRangeToJpeg16_1920_c: 2152.4 3164.8 (0.68x) lumRangeFromJpeg8_1920_c: 1263.0 1302.5 (0.97x) lumRangeFromJpeg16_1920_c: 1080.5 1299.2 (0.83x) lumRangeToJpeg8_1920_c: 1886.8 2112.2 (0.89x) lumRangeToJpeg16_1920_c: 1077.0 1906.5 (0.56x) aarch64 A55: chrRangeFromJpeg8_1920_c: 28835.2 28835.6 (1.00x) chrRangeFromJpeg16_1920_c: 28839.8 32680.8 (0.88x) chrRangeToJpeg8_1920_c: 23074.7 23075.4 (1.00x) chrRangeToJpeg16_1920_c: 17318.9 24996.0 (0.69x) lumRangeFromJpeg8_1920_c: 15389.7 15384.5 (1.00x) lumRangeFromJpeg16_1920_c: 15388.2 17306.7 (0.89x) lumRangeToJpeg8_1920_c: 19227.8 19226.6 (1.00x) lumRangeToJpeg16_1920_c: 15387.0 21146.3 (0.73x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6324.4 6268.1 (1.01x) chrRangeFromJpeg16_1920_c: 6339.9 11521.5 (0.55x) chrRangeToJpeg8_1920_c: 9656.0 9612.8 (1.00x) chrRangeToJpeg16_1920_c: 6340.4 11651.8 (0.54x) lumRangeFromJpeg8_1920_c: 4422.0 4420.8 (1.00x) lumRangeFromJpeg16_1920_c: 4420.9 5762.0 (0.77x) lumRangeToJpeg8_1920_c: 5949.1 5977.5 (1.00x) lumRangeToJpeg16_1920_c: 4446.8 5946.2 (0.75x) NOTE: all simd optimizations for range_convert have been disabled. they will be re-enabled when they are fixed for each architecture. NOTE2: the same issue still exists in rgb2yuv conversions, which is not addressed in this commit.	2024-12-05 21:10:29 +01:00
Ramiro Polla	58bcdeb742	swscale/aarch64/range_convert: saturate output instead of limiting input aarch64 A55: chrRangeFromJpeg8_1920_c: 28836.2 (1.00x) chrRangeFromJpeg8_1920_neon: 5312.6 (5.43x) 5313.9 (5.43x) chrRangeToJpeg8_1920_c: 44196.2 (1.00x) chrRangeToJpeg8_1920_neon: 6034.6 (7.32x) 5551.3 (7.96x) lumRangeFromJpeg8_1920_c: 15388.5 (1.00x) lumRangeFromJpeg8_1920_neon: 3150.7 (4.88x) 3152.3 (4.88x) lumRangeToJpeg8_1920_c: 23069.7 (1.00x) lumRangeToJpeg8_1920_neon: 3873.2 (5.96x) 3628.7 (6.36x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6334.7 (1.00x) chrRangeFromJpeg8_1920_neon: 2264.5 (2.80x) 2344.5 (2.70x) chrRangeToJpeg8_1920_c: 11474.5 (1.00x) chrRangeToJpeg8_1920_neon: 2646.5 (4.34x) 2824.2 (4.06x) lumRangeFromJpeg8_1920_c: 4453.2 (1.00x) lumRangeFromJpeg8_1920_neon: 1104.8 (4.03x) 1104.5 (4.03x) lumRangeToJpeg8_1920_c: 6645.0 (1.00x) lumRangeToJpeg8_1920_neon: 1310.5 (5.07x) 1329.8 (5.00x)	2024-12-05 21:10:29 +01:00
Ramiro Polla	2d1358a84d	swscale/range_convert: saturate output instead of limiting input For bit depths <= 14, the result is saturated to 15 bits. For bit depths > 14, the result is saturated to 19 bits. x86_64: chrRangeFromJpeg8_1920_c: 2126.5 2127.4 (1.00x) chrRangeFromJpeg16_1920_c: 2331.4 2325.2 (1.00x) chrRangeToJpeg8_1920_c: 3163.0 3166.9 (1.00x) chrRangeToJpeg16_1920_c: 3163.7 2152.4 (1.47x) lumRangeFromJpeg8_1920_c: 1262.2 1263.0 (1.00x) lumRangeFromJpeg16_1920_c: 1079.5 1080.5 (1.00x) lumRangeToJpeg8_1920_c: 1860.5 1886.8 (0.99x) lumRangeToJpeg16_1920_c: 1910.2 1077.0 (1.77x) aarch64 A55: chrRangeFromJpeg8_1920_c: 28836.2 28835.2 (1.00x) chrRangeFromJpeg16_1920_c: 28840.1 28839.8 (1.00x) chrRangeToJpeg8_1920_c: 44196.2 23074.7 (1.92x) chrRangeToJpeg16_1920_c: 36527.3 17318.9 (2.11x) lumRangeFromJpeg8_1920_c: 15388.5 15389.7 (1.00x) lumRangeFromJpeg16_1920_c: 15389.3 15388.2 (1.00x) lumRangeToJpeg8_1920_c: 23069.7 19227.8 (1.20x) lumRangeToJpeg16_1920_c: 19227.8 15387.0 (1.25x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6334.7 6324.4 (1.00x) chrRangeFromJpeg16_1920_c: 6336.0 6339.9 (1.00x) chrRangeToJpeg8_1920_c: 11474.5 9656.0 (1.19x) chrRangeToJpeg16_1920_c: 9640.5 6340.4 (1.52x) lumRangeFromJpeg8_1920_c: 4453.2 4422.0 (1.01x) lumRangeFromJpeg16_1920_c: 4414.2 4420.9 (1.00x) lumRangeToJpeg8_1920_c: 6645.0 5949.1 (1.12x) lumRangeToJpeg16_1920_c: 6005.2 4446.8 (1.35x) NOTE: all simd optimizations for range_convert have been disabled except for x86, which already had the same behaviour. they will be re-enabled when they are fixed for each architecture.	2024-12-05 21:10:29 +01:00
Niklas Haas	2d077f9acd	swscale/internal: group user-facing options together This is a preliminary step to separating these into a new struct. This commit contains no functional changes, it is a pure search-and-replace. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2024-11-21 12:49:56 +01:00
Ramiro Polla	f7ee0195df	swscale/range_convert: drop redundant conditionals from arch-specific init functions These conditions are already checked for in the main init function.	2024-10-27 13:20:56 +01:00
Ramiro Polla	7728b3357d	swscale/range_convert: call arch-specific init functions from main init function This commit also fixes the issue that the call to ff_sws_init_range_convert() from sws_init_swscale() was not setting up the arch-specific optimizations.	2024-10-27 13:20:56 +01:00
Niklas Haas	67adb30322	swscale: rename SwsContext to SwsInternal And preserve the public SwsContext as separate name. The motivation here is that I want to turn SwsContext into a public struct, while keeping the internal implementation hidden. Additionally, I also want to be able to use multiple internal implementations, e.g. for GPU devices. This commit does not include any functional changes. For the most part, it is a simple rename. The only complications arise from the public facing API functions, which preserve their current type (and hence require an additional unwrapping step internally), and the checkasm test framework, which directly accesses SwsInternal. For consistency, the affected functions that need to maintain a distionction have generally been changed to refer to the SwsContext as sws, and the SwsInternal as c. In an upcoming commit, I will provide a backing definition for the public SwsContext, and update `sws_internal()` to dereference the internal struct instead of merely casting it. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2024-10-24 22:50:00 +02:00
Martin Storsjö	b9145fcab2	swscale: Fix aarch64 and i386 compilation failures This unbreaks builds after `c1a0e65763`, which broke with errors like src/libswscale/aarch64/rgb2rgb.c:66:25: error: incompatible function pointer types assigning to 'void ()(const uint8_t , uint8_t , uint8_t , uint8_t , int, int, int, int, int, const int32_t )' (aka 'void ()(const unsigned char , unsigned char , unsigned char , unsigned char , int, int, int, int, int, const int )') from 'void (const uint8_t , uint8_t , uint8_t , uint8_t , int, int, int, int, int, int32_t )' (aka 'void (const unsigned char , unsigned char , unsigned char , unsigned char , int, int, int, int, int, int )') [-Wincompatible-function-pointer-types] 66 \| ff_rgb24toyv12 = rgb24toyv12; \| ^ ~~~~~~~~~~~ and src/libswscale/aarch64/swscale_unscaled.c:213:29: error: incompatible function pointer types assigning to 'SwsFunc' (aka 'int ()(struct SwsContext , const unsigned char const , const int , int, int, unsigned char const , const int )') from 'int (SwsContext , const uint8_t const , const int , int, int, const uint8_t *, const int )' (aka 'int (struct SwsContext , const unsigned char const , const int , int, int, const unsigned char *, const int )') [-Wincompatible-function-pointer-types] 213 \| c->convert_unscaled = nv24_to_yuv420p_neon_wrapper; \| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Martin Storsjö <martin@martin.st>	2024-10-08 09:29:07 +03:00
Niklas Haas	c1a0e65763	swscale/internal: constify SwsFunc I want to move away from having random leaf processing functions mutate plane pointers, and while we're at it, we might as well make the strides and tables const as well. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2024-10-07 19:51:34 +02:00
Zhao Zhili	e18b46d95f	swscale/aarch64: Fix rgb24toyv12 only works with aligned width Since `c0666d8b`, rgb24toyv12 is broken for width non-aligned to 16. Add a simple wrapper to handle the non-aligned part. Co-authored-by: johzzy <hellojinqiang@gmail.com> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-09-24 10:24:14 +08:00
Ramiro Polla	c0666d8bed	swscale/aarch64/rgb2rgb: add neon implementation for rgb24toyv12 A55 A76 rgb24toyv12_16_200_c: 36890.6 17275.5 rgb24toyv12_16_200_neon: 12460.1 ( 2.96x) 5360.8 ( 3.22x) rgb24toyv12_128_60_c: 83205.1 39884.8 rgb24toyv12_128_60_neon: 27468.4 ( 3.03x) 13552.5 ( 2.94x) rgb24toyv12_512_16_c: 88111.6 42346.8 rgb24toyv12_512_16_neon: 29126.6 ( 3.03x) 14411.2 ( 2.94x) rgb24toyv12_1920_4_c: 82068.1 39620.0 rgb24toyv12_1920_4_neon: 27011.6 ( 3.04x) 13492.2 ( 2.94x)	2024-09-06 23:11:13 +02:00
Ramiro Polla	d8848325a6	swscale/aarch64/rgb2rgb: add deinterleaveBytes neon implementation A55 A76 deinterleave_bytes_c: 70342.0 34497.5 deinterleave_bytes_neon: 21594.5 ( 3.26x) 5535.2 ( 6.23x) deinterleave_bytes_aligned_c: 71340.8 34651.2 deinterleave_bytes_aligned_neon: 8616.8 ( 8.28x) 3996.2 ( 8.67x)	2024-09-06 23:05:09 +02:00
Ramiro Polla	420d443600	swscale/aarch64: cosmetics fix (spaces inside curly braces)	2024-08-26 11:07:49 +02:00
Ramiro Polla	52887683e9	swscale/aarch64: add nv24/nv42 to yuv420p unscaled converter A55 A76 nv24_yuv420p_128_c: 4956.1 1267.0 nv24_yuv420p_128_neon: 3109.1 ( 1.59x) 640.0 ( 1.98x) nv24_yuv420p_1920_c: 35728.4 11736.2 nv24_yuv420p_1920_neon: 8011.1 ( 4.46x) 2436.0 ( 4.82x) nv42_yuv420p_128_c: 4956.4 1270.5 nv42_yuv420p_128_neon: 3074.6 ( 1.61x) 639.5 ( 1.99x) nv42_yuv420p_1920_c: 35685.9 11732.5 nv42_yuv420p_1920_neon: 7995.1 ( 4.46x) 2437.2 ( 4.81x)	2024-08-26 11:04:46 +02:00
Martin Storsjö	cfe0a36352	libswscale: aarch64: Fix the indentation of some macro invocations Signed-off-by: Martin Storsjö <martin@martin.st>	2024-08-22 14:40:30 +03:00
Ramiro Polla	181cd260db	swscale/aarch64/yuv2rgb: add neon yuv42{0,2}p -> gbrp unscaled colorspace converters checkasm --bench on a Raspberry Pi 5 Model B Rev 1.0: yuv420p_gbrp_128_c: 1243.0 yuv420p_gbrp_128_neon: 453.5 yuv420p_gbrp_1920_c: 18165.5 yuv420p_gbrp_1920_neon: 6700.0 yuv422p_gbrp_128_c: 1463.5 yuv422p_gbrp_128_neon: 471.5 yuv422p_gbrp_1920_c: 21343.7 yuv422p_gbrp_1920_neon: 6743.5	2024-08-18 22:26:17 +02:00
Zhao Zhili	4d90a76986	swscale/aarch64: Add argb/abgr to yuv Test on Apple M1 with kperf: : -O3 : -O3 -fno-vectorize abgr_to_uv_8_c : 19.4 : 26.1 abgr_to_uv_8_neon : 29.9 : 51.1 abgr_to_uv_128_c : 146.4 : 558.9 abgr_to_uv_128_neon : 85.1 : 83.4 abgr_to_uv_1080_c : 1162.6 : 4786.4 abgr_to_uv_1080_neon : 819.6 : 826.6 abgr_to_uv_1920_c : 2063.6 : 8492.1 abgr_to_uv_1920_neon : 1435.1 : 1447.1 abgr_to_uv_half_8_c : 16.4 : 11.4 abgr_to_uv_half_8_neon : 35.6 : 20.4 abgr_to_uv_half_128_c : 108.6 : 359.4 abgr_to_uv_half_128_neon : 75.4 : 42.6 abgr_to_uv_half_1080_c : 883.4 : 2885.6 abgr_to_uv_half_1080_neon : 460.6 : 481.1 abgr_to_uv_half_1920_c : 1553.6 : 5106.9 abgr_to_uv_half_1920_neon : 817.6 : 820.4 abgr_to_y_8_c : 6.1 : 26.4 abgr_to_y_8_neon : 40.6 : 6.4 abgr_to_y_128_c : 99.9 : 390.1 abgr_to_y_128_neon : 67.4 : 55.9 abgr_to_y_1080_c : 735.9 : 3170.4 abgr_to_y_1080_neon : 534.6 : 536.6 abgr_to_y_1920_c : 1279.4 : 6016.4 abgr_to_y_1920_neon : 932.6 : 927.6 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-05 16:32:31 +08:00
Zhao Zhili	52422133ae	swscale/aarch64: Add bgra/rgba to yuv Test on Apple M1 with kperf : -O3 : -O3 -fno-vectorize bgra_to_uv_8_c : 13.4 : 27.5 bgra_to_uv_8_neon : 37.4 : 41.7 bgra_to_uv_128_c : 155.9 : 550.2 bgra_to_uv_128_neon : 91.7 : 92.7 bgra_to_uv_1080_c : 1173.2 : 4558.2 bgra_to_uv_1080_neon : 822.7 : 809.5 bgra_to_uv_1920_c : 2078.2 : 8115.2 bgra_to_uv_1920_neon : 1437.7 : 1438.7 bgra_to_uv_half_8_c : 17.9 : 14.2 bgra_to_uv_half_8_neon : 37.4 : 10.5 bgra_to_uv_half_128_c : 103.9 : 326.0 bgra_to_uv_half_128_neon : 73.9 : 68.7 bgra_to_uv_half_1080_c : 850.2 : 3732.0 bgra_to_uv_half_1080_neon : 484.2 : 490.0 bgra_to_uv_half_1920_c : 1479.2 : 4942.7 bgra_to_uv_half_1920_neon : 824.2 : 824.7 bgra_to_y_8_c : 8.2 : 29.5 bgra_to_y_8_neon : 18.2 : 32.7 bgra_to_y_128_c : 101.4 : 361.5 bgra_to_y_128_neon : 74.9 : 73.7 bgra_to_y_1080_c : 739.4 : 3018.0 bgra_to_y_1080_neon : 613.4 : 544.2 bgra_to_y_1920_c : 1298.7 : 5326.0 bgra_to_y_1920_neon : 918.7 : 934.2 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-05 16:32:31 +08:00
Zhao Zhili	b8b71be07a	swscale/aarch64: Add bgr24 to yuv Test on Apple M1 with kperf : -O3 : -O3 -fno-vectorize bgr24_to_uv_8_c : 28.5 : 52.5 bgr24_to_uv_8_neon : 54.5 : 59.7 bgr24_to_uv_128_c : 294.0 : 830.7 bgr24_to_uv_128_neon : 99.7 : 112.0 bgr24_to_uv_1080_c : 965.0 : 6624.0 bgr24_to_uv_1080_neon : 751.5 : 754.7 bgr24_to_uv_1920_c : 1693.2 : 11554.5 bgr24_to_uv_1920_neon : 1292.5 : 1307.5 bgr24_to_uv_half_8_c : 54.2 : 37.0 bgr24_to_uv_half_8_neon : 27.2 : 22.5 bgr24_to_uv_half_128_c : 127.2 : 392.5 bgr24_to_uv_half_128_neon : 63.0 : 52.0 bgr24_to_uv_half_1080_c : 880.2 : 3329.0 bgr24_to_uv_half_1080_neon : 401.5 : 390.7 bgr24_to_uv_half_1920_c : 1585.7 : 6390.7 bgr24_to_uv_half_1920_neon : 694.7 : 698.7 bgr24_to_y_8_c : 21.7 : 22.5 bgr24_to_y_8_neon : 797.2 : 25.5 bgr24_to_y_128_c : 88.0 : 280.5 bgr24_to_y_128_neon : 63.7 : 55.0 bgr24_to_y_1080_c : 616.7 : 2208.7 bgr24_to_y_1080_neon : 900.0 : 452.0 bgr24_to_y_1920_c : 1093.2 : 3894.7 bgr24_to_y_1920_neon : 777.2 : 767.5 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-05 16:32:31 +08:00
Ramiro Polla	75f1a8e071	swscale/aarch64: add neon {lum,chr}ConvertRange chrRangeFromJpeg_8_c: 29.2 chrRangeFromJpeg_8_neon: 19.5 chrRangeFromJpeg_24_c: 80.5 chrRangeFromJpeg_24_neon: 34.0 chrRangeFromJpeg_128_c: 413.7 chrRangeFromJpeg_128_neon: 156.0 chrRangeFromJpeg_144_c: 471.0 chrRangeFromJpeg_144_neon: 174.2 chrRangeFromJpeg_256_c: 842.0 chrRangeFromJpeg_256_neon: 305.5 chrRangeFromJpeg_512_c: 1699.0 chrRangeFromJpeg_512_neon: 608.0 chrRangeToJpeg_8_c: 51.7 chrRangeToJpeg_8_neon: 22.7 chrRangeToJpeg_24_c: 149.7 chrRangeToJpeg_24_neon: 38.0 chrRangeToJpeg_128_c: 761.7 chrRangeToJpeg_128_neon: 176.7 chrRangeToJpeg_144_c: 866.2 chrRangeToJpeg_144_neon: 198.7 chrRangeToJpeg_256_c: 1516.5 chrRangeToJpeg_256_neon: 348.7 chrRangeToJpeg_512_c: 3067.2 chrRangeToJpeg_512_neon: 692.7 lumRangeFromJpeg_8_c: 24.0 lumRangeFromJpeg_8_neon: 17.0 lumRangeFromJpeg_24_c: 56.7 lumRangeFromJpeg_24_neon: 21.0 lumRangeFromJpeg_128_c: 294.5 lumRangeFromJpeg_128_neon: 76.7 lumRangeFromJpeg_144_c: 332.5 lumRangeFromJpeg_144_neon: 86.7 lumRangeFromJpeg_256_c: 586.0 lumRangeFromJpeg_256_neon: 152.2 lumRangeFromJpeg_512_c: 1190.0 lumRangeFromJpeg_512_neon: 298.0 lumRangeToJpeg_8_c: 31.7 lumRangeToJpeg_8_neon: 19.5 lumRangeToJpeg_24_c: 83.5 lumRangeToJpeg_24_neon: 24.2 lumRangeToJpeg_128_c: 440.5 lumRangeToJpeg_128_neon: 91.0 lumRangeToJpeg_144_c: 504.2 lumRangeToJpeg_144_neon: 101.0 lumRangeToJpeg_256_c: 879.7 lumRangeToJpeg_256_neon: 177.2 lumRangeToJpeg_512_c: 1794.2 lumRangeToJpeg_512_neon: 354.0	2024-06-18 23:12:41 +02:00
Zhao Zhili	9dac8495b0	swscale/aarch64: Add rgb24 to yuv implementation Test on Apple M1: rgb24_to_uv_8_c: 0.0 rgb24_to_uv_8_neon: 0.2 rgb24_to_uv_128_c: 1.0 rgb24_to_uv_128_neon: 0.5 rgb24_to_uv_1080_c: 7.0 rgb24_to_uv_1080_neon: 5.7 rgb24_to_uv_1920_c: 12.5 rgb24_to_uv_1920_neon: 9.5 rgb24_to_uv_half_8_c: 0.2 rgb24_to_uv_half_8_neon: 0.2 rgb24_to_uv_half_128_c: 1.0 rgb24_to_uv_half_128_neon: 0.5 rgb24_to_uv_half_1080_c: 6.2 rgb24_to_uv_half_1080_neon: 3.0 rgb24_to_uv_half_1920_c: 11.2 rgb24_to_uv_half_1920_neon: 5.2 rgb24_to_y_8_c: 0.2 rgb24_to_y_8_neon: 0.0 rgb24_to_y_128_c: 0.5 rgb24_to_y_128_neon: 0.5 rgb24_to_y_1080_c: 4.7 rgb24_to_y_1080_neon: 3.2 rgb24_to_y_1920_c: 8.0 rgb24_to_y_1920_neon: 5.7 On Pixel 6: rgb24_to_uv_8_c: 30.7 rgb24_to_uv_8_neon: 56.9 rgb24_to_uv_128_c: 213.9 rgb24_to_uv_128_neon: 173.2 rgb24_to_uv_1080_c: 1649.9 rgb24_to_uv_1080_neon: 1424.4 rgb24_to_uv_1920_c: 2907.9 rgb24_to_uv_1920_neon: 2480.7 rgb24_to_uv_half_8_c: 36.2 rgb24_to_uv_half_8_neon: 33.4 rgb24_to_uv_half_128_c: 167.9 rgb24_to_uv_half_128_neon: 99.4 rgb24_to_uv_half_1080_c: 1293.9 rgb24_to_uv_half_1080_neon: 778.7 rgb24_to_uv_half_1920_c: 2292.7 rgb24_to_uv_half_1920_neon: 1328.7 rgb24_to_y_8_c: 19.7 rgb24_to_y_8_neon: 27.7 rgb24_to_y_128_c: 129.9 rgb24_to_y_128_neon: 96.7 rgb24_to_y_1080_c: 995.4 rgb24_to_y_1080_neon: 767.7 rgb24_to_y_1920_c: 1747.4 rgb24_to_y_1920_neon: 1337.2 Note both tests use clang as compiler, which has vectorization enabled by default with -O3. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-06-11 01:12:09 +08:00
xufuji456	cc86343b96	lavc/hevcdsp_qpel_neon: using movi.16b instead of movi.2d Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b" Signed-off-by: xufuji456 <839789740@qq.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-11-28 15:54:49 +02:00
Martin Storsjö	a76b409dd0	aarch64: Reindent all assembly to 8/24 column indentation libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally uses a layered indentation style to visually show how different unrolled/interleaved phases fit together. Signed-off-by: Martin Storsjö <martin@martin.st>	2023-10-21 23:25:54 +03:00

1 2

77 Commits