FFmpeg

mirror of https://mirror.skon.top/https://github.com/FFmpeg/FFmpeg synced 2026-04-30 13:50:50 +08:00

Author	SHA1	Message	Date
Arpad Panyik	1f30ff30fb	swscale: Add AArch64 Neon path for xyz12Torgb48 LE Add optimized Neon code path for the little endian case of the xyz12Torgb48 function. The innermost loop processes the data in 4x2 pixel blocks using software gathers with the matrix multiplication and clipping done by Neon. Relative runtime of micro benchmarks after this patch on some Cortex and Neoverse CPU cores: xyz12le_rgb48le X1 X3 X4 X925 V2 16x4_neon: 2.55x 4.34x 3.84x 3.31x 3.22x 32x4_neon: 2.39x 3.63x 3.22x 3.35x 3.29x 64x4_neon: 2.37x 3.31x 2.91x 3.33x 3.27x 128x4_neon: 2.34x 3.28x 2.91x 3.35x 3.24x 256x4_neon: 2.30x 3.17x 2.91x 3.32x 3.10x 512x4_neon: 2.26x 3.10x 2.91x 3.30x 3.07x 1024x4_neon: 2.26x 3.07x 2.96x 3.30x 3.05x 1920x4_neon: 2.26x 3.06x 2.93x 3.28x 3.04x xyz12le_rgb48le A76 A78 A715 A720 A725 16x4_neon: 2.33x 2.28x 2.53x 3.33x 3.19x 32x4_neon: 2.35x 2.18x 2.45x 3.23x 3.24x 64x4_neon: 2.35x 2.16x 2.42x 3.15x 3.21x 128x4_neon: 2.35x 2.13x 2.39x 3.00x 3.09x 256x4_neon: 2.36x 2.12x 2.35x 2.85x 2.99x 512x4_neon: 2.35x 2.14x 2.35x 2.78x 2.95x 1024x4_neon: 2.31x 2.09x 2.33x 2.80x 2.91x 1920x4_neon: 2.30x 2.07x 2.32x 2.81x 2.94x xyz12le_rgb48le A55 A510 A520 16x4_neon: 2.09x 1.92x 2.36x 32x4_neon: 2.05x 1.89x 2.38x 64x4_neon: 2.02x 1.77x 2.35x 128x4_neon: 1.96x 1.74x 2.25x 256x4_neon: 1.90x 1.72x 2.19x 512x4_neon: 1.83x 1.75x 2.16x 1024x4_neon: 1.83x 1.62x 2.15x 1920x4_neon: 1.82x 1.60x 2.15x Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>	2025-12-05 10:28:18 +00:00
Dash Santosh	ca2a88c1b3	swscale/output: Implement yuv2nv12cx neon assembly yuv2nv12cX_2_512_accurate_c: 3540.1 ( 1.00x) yuv2nv12cX_2_512_accurate_neon: 408.0 ( 8.68x) yuv2nv12cX_2_512_approximate_c: 3521.4 ( 1.00x) yuv2nv12cX_2_512_approximate_neon: 409.2 ( 8.61x) yuv2nv12cX_4_512_accurate_c: 4740.0 ( 1.00x) yuv2nv12cX_4_512_accurate_neon: 604.4 ( 7.84x) yuv2nv12cX_4_512_approximate_c: 4681.9 ( 1.00x) yuv2nv12cX_4_512_approximate_neon: 603.3 ( 7.76x) yuv2nv12cX_8_512_accurate_c: 7273.1 ( 1.00x) yuv2nv12cX_8_512_accurate_neon: 1012.2 ( 7.19x) yuv2nv12cX_8_512_approximate_c: 7223.0 ( 1.00x) yuv2nv12cX_8_512_approximate_neon: 1015.8 ( 7.11x) yuv2nv12cX_16_512_accurate_c: 13762.0 ( 1.00x) yuv2nv12cX_16_512_accurate_neon: 1761.4 ( 7.81x) yuv2nv12cX_16_512_approximate_c: 13884.0 ( 1.00x) yuv2nv12cX_16_512_approximate_neon: 1766.8 ( 7.86x) Benchmarked on: Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU 3417 Mhz, 12 Core(s), 12 Logical Processor(s)	2025-08-12 09:05:00 +00:00
Logaprakash Ramajayam	49477972b7	swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() yuv2yuvX_8_2_0_512_accurate_c: 2213.4 ( 1.00x) yuv2yuvX_8_2_0_512_accurate_neon: 147.5 (15.01x) yuv2yuvX_8_2_0_512_approximate_c: 2203.9 ( 1.00x) yuv2yuvX_8_2_0_512_approximate_neon: 154.1 (14.30x) yuv2yuvX_8_2_16_512_accurate_c: 2147.2 ( 1.00x) yuv2yuvX_8_2_16_512_accurate_neon: 150.8 (14.24x) yuv2yuvX_8_2_16_512_approximate_c: 2149.7 ( 1.00x) yuv2yuvX_8_2_16_512_approximate_neon: 146.8 (14.64x) yuv2yuvX_8_2_32_512_accurate_c: 2078.9 ( 1.00x) yuv2yuvX_8_2_32_512_accurate_neon: 139.0 (14.95x) yuv2yuvX_8_2_32_512_approximate_c: 2083.7 ( 1.00x) yuv2yuvX_8_2_32_512_approximate_neon: 140.5 (14.84x) yuv2yuvX_8_2_48_512_accurate_c: 2010.7 ( 1.00x) yuv2yuvX_8_2_48_512_accurate_neon: 138.2 (14.55x) yuv2yuvX_8_2_48_512_approximate_c: 2012.6 ( 1.00x) yuv2yuvX_8_2_48_512_approximate_neon: 141.2 (14.26x) yuv2yuvX_10LE_16_0_512_accurate_c: 7874.1 ( 1.00x) yuv2yuvX_10LE_16_0_512_accurate_neon: 831.6 ( 9.47x) yuv2yuvX_10LE_16_0_512_approximate_c: 7918.1 ( 1.00x) yuv2yuvX_10LE_16_0_512_approximate_neon: 836.1 ( 9.47x) yuv2yuvX_10LE_16_16_512_accurate_c: 7630.9 ( 1.00x) yuv2yuvX_10LE_16_16_512_accurate_neon: 804.5 ( 9.49x) yuv2yuvX_10LE_16_16_512_approximate_c: 7724.7 ( 1.00x) yuv2yuvX_10LE_16_16_512_approximate_neon: 808.6 ( 9.55x) yuv2yuvX_10LE_16_32_512_accurate_c: 7436.4 ( 1.00x) yuv2yuvX_10LE_16_32_512_accurate_neon: 780.4 ( 9.53x) yuv2yuvX_10LE_16_32_512_approximate_c: 7366.7 ( 1.00x) yuv2yuvX_10LE_16_32_512_approximate_neon: 780.5 ( 9.44x) yuv2yuvX_10LE_16_48_512_accurate_c: 7099.9 ( 1.00x) yuv2yuvX_10LE_16_48_512_accurate_neon: 761.0 ( 9.33x) yuv2yuvX_10LE_16_48_512_approximate_c: 7097.6 ( 1.00x) yuv2yuvX_10LE_16_48_512_approximate_neon: 754.6 ( 9.41x) Benchmarked on: Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU 3417 Mhz, 12 Core(s), 12 Logical Processor(s)	2025-08-12 09:05:00 +00:00
Martin Storsjö	73f4668ef8	swscale: aarch64: Simplify the assignment of lumToYV12 We normally don't need else statements here; the common pattern is to assign lower level SIMD implementations first, then conditionally reassign higher level ones afterwards, if supported. Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-10 14:03:58 +02:00
Krzysztof Pyrkosz	d765e5f043	swscale/aarch64: dotprod implementation of rgba32_to_Y The idea is to split the 16 bit coefficients into lower and upper half, invoke udot for the lower half, shift by 8, and follow by udot for the upper half. Benchmark on A78: bgra_to_y_128_c: 682.0 ( 1.00x) bgra_to_y_128_neon: 181.2 ( 3.76x) bgra_to_y_128_dotprod: 117.8 ( 5.79x) bgra_to_y_1080_c: 5742.5 ( 1.00x) bgra_to_y_1080_neon: 1472.5 ( 3.90x) bgra_to_y_1080_dotprod: 906.5 ( 6.33x) bgra_to_y_1920_c: 10194.0 ( 1.00x) bgra_to_y_1920_neon: 2589.8 ( 3.94x) bgra_to_y_1920_dotprod: 1573.8 ( 6.48x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-04 10:16:44 +02:00
Ramiro Polla	ca889b1328	swscale/aarch64: add neon {lum,chr}ConvertRange16 aarch64 A55: chrRangeFromJpeg16_1920_c: 32684.2 chrRangeFromJpeg16_1920_neon: 8431.2 (3.88x) chrRangeToJpeg16_1920_c: 24996.8 chrRangeToJpeg16_1920_neon: 9395.0 (2.66x) lumRangeFromJpeg16_1920_c: 17305.2 lumRangeFromJpeg16_1920_neon: 4586.5 (3.77x) lumRangeToJpeg16_1920_c: 21144.8 lumRangeToJpeg16_1920_neon: 5069.8 (4.17x) aarch64 A76: chrRangeFromJpeg16_1920_c: 11523.8 chrRangeFromJpeg16_1920_neon: 3367.5 (3.42x) chrRangeToJpeg16_1920_c: 11655.2 chrRangeToJpeg16_1920_neon: 4087.2 (2.85x) lumRangeFromJpeg16_1920_c: 5762.0 lumRangeFromJpeg16_1920_neon: 1815.8 (3.17x) lumRangeToJpeg16_1920_c: 5946.2 lumRangeToJpeg16_1920_neon: 2148.2 (2.77x)	2024-12-05 21:10:29 +01:00
Ramiro Polla	6fe4a4ffb6	swscale/aarch64/range_convert: update neon range_convert functions to new API aarch64 A55: chrRangeFromJpeg8_1920_c: 28835.2 (1.00x) chrRangeFromJpeg8_1920_neon: 5313.9 (5.43x) 5308.4 (5.43x) chrRangeToJpeg8_1920_c: 23074.7 (1.00x) chrRangeToJpeg8_1920_neon: 5551.3 (4.16x) 5549.2 (4.16x) lumRangeFromJpeg8_1920_c: 15389.7 (1.00x) lumRangeFromJpeg8_1920_neon: 3152.3 (4.88x) 3147.7 (4.89x) lumRangeToJpeg8_1920_c: 19227.8 (1.00x) lumRangeToJpeg8_1920_neon: 3628.7 (5.30x) 3630.2 (5.30x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6324.4 (1.00x) chrRangeFromJpeg8_1920_neon: 2344.5 (2.70x) 2304.2 (2.74x) chrRangeToJpeg8_1920_c: 9656.0 (1.00x) chrRangeToJpeg8_1920_neon: 2824.2 (3.42x) 2794.2 (3.46x) lumRangeFromJpeg8_1920_c: 4422.0 (1.00x) lumRangeFromJpeg8_1920_neon: 1104.5 (4.00x) 1106.2 (4.00x) lumRangeToJpeg8_1920_c: 5949.1 (1.00x) lumRangeToJpeg8_1920_neon: 1329.8 (4.47x) 1328.2 (4.48x)	2024-12-05 21:10:29 +01:00
Ramiro Polla	384fe39623	swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats There is an issue with the constants used in YUV to YUV range conversion, where the upper bound is not respected when converting to mpeg range. With this commit, the constants are calculated at runtime, depending on the bit depth. This approach also allows us to more easily understand how the constants are derived. For bit depths <= 14, the number of fixed point bits has been set to 14 for all conversions, to simplify the code. For bit depths > 14, the number of fixed points bits has been raised and set to 18, to allow for the conversion to be accurate enough for the mpeg range to be respected. The convert functions now take the conversion constants (coeff and offset) as function arguments. For bit depths <= 14, coeff is unsigned 16-bit and offset is 32-bit. For bit depths > 14, coeff is unsigned 32-bit and offset is 64-bit. x86_64: chrRangeFromJpeg8_1920_c: 2127.4 2125.0 (1.00x) chrRangeFromJpeg16_1920_c: 2325.2 2127.2 (1.09x) chrRangeToJpeg8_1920_c: 3166.9 3168.7 (1.00x) chrRangeToJpeg16_1920_c: 2152.4 3164.8 (0.68x) lumRangeFromJpeg8_1920_c: 1263.0 1302.5 (0.97x) lumRangeFromJpeg16_1920_c: 1080.5 1299.2 (0.83x) lumRangeToJpeg8_1920_c: 1886.8 2112.2 (0.89x) lumRangeToJpeg16_1920_c: 1077.0 1906.5 (0.56x) aarch64 A55: chrRangeFromJpeg8_1920_c: 28835.2 28835.6 (1.00x) chrRangeFromJpeg16_1920_c: 28839.8 32680.8 (0.88x) chrRangeToJpeg8_1920_c: 23074.7 23075.4 (1.00x) chrRangeToJpeg16_1920_c: 17318.9 24996.0 (0.69x) lumRangeFromJpeg8_1920_c: 15389.7 15384.5 (1.00x) lumRangeFromJpeg16_1920_c: 15388.2 17306.7 (0.89x) lumRangeToJpeg8_1920_c: 19227.8 19226.6 (1.00x) lumRangeToJpeg16_1920_c: 15387.0 21146.3 (0.73x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6324.4 6268.1 (1.01x) chrRangeFromJpeg16_1920_c: 6339.9 11521.5 (0.55x) chrRangeToJpeg8_1920_c: 9656.0 9612.8 (1.00x) chrRangeToJpeg16_1920_c: 6340.4 11651.8 (0.54x) lumRangeFromJpeg8_1920_c: 4422.0 4420.8 (1.00x) lumRangeFromJpeg16_1920_c: 4420.9 5762.0 (0.77x) lumRangeToJpeg8_1920_c: 5949.1 5977.5 (1.00x) lumRangeToJpeg16_1920_c: 4446.8 5946.2 (0.75x) NOTE: all simd optimizations for range_convert have been disabled. they will be re-enabled when they are fixed for each architecture. NOTE2: the same issue still exists in rgb2yuv conversions, which is not addressed in this commit.	2024-12-05 21:10:29 +01:00
Ramiro Polla	58bcdeb742	swscale/aarch64/range_convert: saturate output instead of limiting input aarch64 A55: chrRangeFromJpeg8_1920_c: 28836.2 (1.00x) chrRangeFromJpeg8_1920_neon: 5312.6 (5.43x) 5313.9 (5.43x) chrRangeToJpeg8_1920_c: 44196.2 (1.00x) chrRangeToJpeg8_1920_neon: 6034.6 (7.32x) 5551.3 (7.96x) lumRangeFromJpeg8_1920_c: 15388.5 (1.00x) lumRangeFromJpeg8_1920_neon: 3150.7 (4.88x) 3152.3 (4.88x) lumRangeToJpeg8_1920_c: 23069.7 (1.00x) lumRangeToJpeg8_1920_neon: 3873.2 (5.96x) 3628.7 (6.36x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6334.7 (1.00x) chrRangeFromJpeg8_1920_neon: 2264.5 (2.80x) 2344.5 (2.70x) chrRangeToJpeg8_1920_c: 11474.5 (1.00x) chrRangeToJpeg8_1920_neon: 2646.5 (4.34x) 2824.2 (4.06x) lumRangeFromJpeg8_1920_c: 4453.2 (1.00x) lumRangeFromJpeg8_1920_neon: 1104.8 (4.03x) 1104.5 (4.03x) lumRangeToJpeg8_1920_c: 6645.0 (1.00x) lumRangeToJpeg8_1920_neon: 1310.5 (5.07x) 1329.8 (5.00x)	2024-12-05 21:10:29 +01:00
Ramiro Polla	2d1358a84d	swscale/range_convert: saturate output instead of limiting input For bit depths <= 14, the result is saturated to 15 bits. For bit depths > 14, the result is saturated to 19 bits. x86_64: chrRangeFromJpeg8_1920_c: 2126.5 2127.4 (1.00x) chrRangeFromJpeg16_1920_c: 2331.4 2325.2 (1.00x) chrRangeToJpeg8_1920_c: 3163.0 3166.9 (1.00x) chrRangeToJpeg16_1920_c: 3163.7 2152.4 (1.47x) lumRangeFromJpeg8_1920_c: 1262.2 1263.0 (1.00x) lumRangeFromJpeg16_1920_c: 1079.5 1080.5 (1.00x) lumRangeToJpeg8_1920_c: 1860.5 1886.8 (0.99x) lumRangeToJpeg16_1920_c: 1910.2 1077.0 (1.77x) aarch64 A55: chrRangeFromJpeg8_1920_c: 28836.2 28835.2 (1.00x) chrRangeFromJpeg16_1920_c: 28840.1 28839.8 (1.00x) chrRangeToJpeg8_1920_c: 44196.2 23074.7 (1.92x) chrRangeToJpeg16_1920_c: 36527.3 17318.9 (2.11x) lumRangeFromJpeg8_1920_c: 15388.5 15389.7 (1.00x) lumRangeFromJpeg16_1920_c: 15389.3 15388.2 (1.00x) lumRangeToJpeg8_1920_c: 23069.7 19227.8 (1.20x) lumRangeToJpeg16_1920_c: 19227.8 15387.0 (1.25x) aarch64 A76: chrRangeFromJpeg8_1920_c: 6334.7 6324.4 (1.00x) chrRangeFromJpeg16_1920_c: 6336.0 6339.9 (1.00x) chrRangeToJpeg8_1920_c: 11474.5 9656.0 (1.19x) chrRangeToJpeg16_1920_c: 9640.5 6340.4 (1.52x) lumRangeFromJpeg8_1920_c: 4453.2 4422.0 (1.01x) lumRangeFromJpeg16_1920_c: 4414.2 4420.9 (1.00x) lumRangeToJpeg8_1920_c: 6645.0 5949.1 (1.12x) lumRangeToJpeg16_1920_c: 6005.2 4446.8 (1.35x) NOTE: all simd optimizations for range_convert have been disabled except for x86, which already had the same behaviour. they will be re-enabled when they are fixed for each architecture.	2024-12-05 21:10:29 +01:00
Niklas Haas	2d077f9acd	swscale/internal: group user-facing options together This is a preliminary step to separating these into a new struct. This commit contains no functional changes, it is a pure search-and-replace. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2024-11-21 12:49:56 +01:00
Ramiro Polla	f7ee0195df	swscale/range_convert: drop redundant conditionals from arch-specific init functions These conditions are already checked for in the main init function.	2024-10-27 13:20:56 +01:00
Ramiro Polla	7728b3357d	swscale/range_convert: call arch-specific init functions from main init function This commit also fixes the issue that the call to ff_sws_init_range_convert() from sws_init_swscale() was not setting up the arch-specific optimizations.	2024-10-27 13:20:56 +01:00
Niklas Haas	67adb30322	swscale: rename SwsContext to SwsInternal And preserve the public SwsContext as separate name. The motivation here is that I want to turn SwsContext into a public struct, while keeping the internal implementation hidden. Additionally, I also want to be able to use multiple internal implementations, e.g. for GPU devices. This commit does not include any functional changes. For the most part, it is a simple rename. The only complications arise from the public facing API functions, which preserve their current type (and hence require an additional unwrapping step internally), and the checkasm test framework, which directly accesses SwsInternal. For consistency, the affected functions that need to maintain a distionction have generally been changed to refer to the SwsContext as sws, and the SwsInternal as c. In an upcoming commit, I will provide a backing definition for the public SwsContext, and update `sws_internal()` to dereference the internal struct instead of merely casting it. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2024-10-24 22:50:00 +02:00
Zhao Zhili	4d90a76986	swscale/aarch64: Add argb/abgr to yuv Test on Apple M1 with kperf: : -O3 : -O3 -fno-vectorize abgr_to_uv_8_c : 19.4 : 26.1 abgr_to_uv_8_neon : 29.9 : 51.1 abgr_to_uv_128_c : 146.4 : 558.9 abgr_to_uv_128_neon : 85.1 : 83.4 abgr_to_uv_1080_c : 1162.6 : 4786.4 abgr_to_uv_1080_neon : 819.6 : 826.6 abgr_to_uv_1920_c : 2063.6 : 8492.1 abgr_to_uv_1920_neon : 1435.1 : 1447.1 abgr_to_uv_half_8_c : 16.4 : 11.4 abgr_to_uv_half_8_neon : 35.6 : 20.4 abgr_to_uv_half_128_c : 108.6 : 359.4 abgr_to_uv_half_128_neon : 75.4 : 42.6 abgr_to_uv_half_1080_c : 883.4 : 2885.6 abgr_to_uv_half_1080_neon : 460.6 : 481.1 abgr_to_uv_half_1920_c : 1553.6 : 5106.9 abgr_to_uv_half_1920_neon : 817.6 : 820.4 abgr_to_y_8_c : 6.1 : 26.4 abgr_to_y_8_neon : 40.6 : 6.4 abgr_to_y_128_c : 99.9 : 390.1 abgr_to_y_128_neon : 67.4 : 55.9 abgr_to_y_1080_c : 735.9 : 3170.4 abgr_to_y_1080_neon : 534.6 : 536.6 abgr_to_y_1920_c : 1279.4 : 6016.4 abgr_to_y_1920_neon : 932.6 : 927.6 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-05 16:32:31 +08:00
Zhao Zhili	52422133ae	swscale/aarch64: Add bgra/rgba to yuv Test on Apple M1 with kperf : -O3 : -O3 -fno-vectorize bgra_to_uv_8_c : 13.4 : 27.5 bgra_to_uv_8_neon : 37.4 : 41.7 bgra_to_uv_128_c : 155.9 : 550.2 bgra_to_uv_128_neon : 91.7 : 92.7 bgra_to_uv_1080_c : 1173.2 : 4558.2 bgra_to_uv_1080_neon : 822.7 : 809.5 bgra_to_uv_1920_c : 2078.2 : 8115.2 bgra_to_uv_1920_neon : 1437.7 : 1438.7 bgra_to_uv_half_8_c : 17.9 : 14.2 bgra_to_uv_half_8_neon : 37.4 : 10.5 bgra_to_uv_half_128_c : 103.9 : 326.0 bgra_to_uv_half_128_neon : 73.9 : 68.7 bgra_to_uv_half_1080_c : 850.2 : 3732.0 bgra_to_uv_half_1080_neon : 484.2 : 490.0 bgra_to_uv_half_1920_c : 1479.2 : 4942.7 bgra_to_uv_half_1920_neon : 824.2 : 824.7 bgra_to_y_8_c : 8.2 : 29.5 bgra_to_y_8_neon : 18.2 : 32.7 bgra_to_y_128_c : 101.4 : 361.5 bgra_to_y_128_neon : 74.9 : 73.7 bgra_to_y_1080_c : 739.4 : 3018.0 bgra_to_y_1080_neon : 613.4 : 544.2 bgra_to_y_1920_c : 1298.7 : 5326.0 bgra_to_y_1920_neon : 918.7 : 934.2 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-05 16:32:31 +08:00
Zhao Zhili	b8b71be07a	swscale/aarch64: Add bgr24 to yuv Test on Apple M1 with kperf : -O3 : -O3 -fno-vectorize bgr24_to_uv_8_c : 28.5 : 52.5 bgr24_to_uv_8_neon : 54.5 : 59.7 bgr24_to_uv_128_c : 294.0 : 830.7 bgr24_to_uv_128_neon : 99.7 : 112.0 bgr24_to_uv_1080_c : 965.0 : 6624.0 bgr24_to_uv_1080_neon : 751.5 : 754.7 bgr24_to_uv_1920_c : 1693.2 : 11554.5 bgr24_to_uv_1920_neon : 1292.5 : 1307.5 bgr24_to_uv_half_8_c : 54.2 : 37.0 bgr24_to_uv_half_8_neon : 27.2 : 22.5 bgr24_to_uv_half_128_c : 127.2 : 392.5 bgr24_to_uv_half_128_neon : 63.0 : 52.0 bgr24_to_uv_half_1080_c : 880.2 : 3329.0 bgr24_to_uv_half_1080_neon : 401.5 : 390.7 bgr24_to_uv_half_1920_c : 1585.7 : 6390.7 bgr24_to_uv_half_1920_neon : 694.7 : 698.7 bgr24_to_y_8_c : 21.7 : 22.5 bgr24_to_y_8_neon : 797.2 : 25.5 bgr24_to_y_128_c : 88.0 : 280.5 bgr24_to_y_128_neon : 63.7 : 55.0 bgr24_to_y_1080_c : 616.7 : 2208.7 bgr24_to_y_1080_neon : 900.0 : 452.0 bgr24_to_y_1920_c : 1093.2 : 3894.7 bgr24_to_y_1920_neon : 777.2 : 767.5 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-05 16:32:31 +08:00
Ramiro Polla	75f1a8e071	swscale/aarch64: add neon {lum,chr}ConvertRange chrRangeFromJpeg_8_c: 29.2 chrRangeFromJpeg_8_neon: 19.5 chrRangeFromJpeg_24_c: 80.5 chrRangeFromJpeg_24_neon: 34.0 chrRangeFromJpeg_128_c: 413.7 chrRangeFromJpeg_128_neon: 156.0 chrRangeFromJpeg_144_c: 471.0 chrRangeFromJpeg_144_neon: 174.2 chrRangeFromJpeg_256_c: 842.0 chrRangeFromJpeg_256_neon: 305.5 chrRangeFromJpeg_512_c: 1699.0 chrRangeFromJpeg_512_neon: 608.0 chrRangeToJpeg_8_c: 51.7 chrRangeToJpeg_8_neon: 22.7 chrRangeToJpeg_24_c: 149.7 chrRangeToJpeg_24_neon: 38.0 chrRangeToJpeg_128_c: 761.7 chrRangeToJpeg_128_neon: 176.7 chrRangeToJpeg_144_c: 866.2 chrRangeToJpeg_144_neon: 198.7 chrRangeToJpeg_256_c: 1516.5 chrRangeToJpeg_256_neon: 348.7 chrRangeToJpeg_512_c: 3067.2 chrRangeToJpeg_512_neon: 692.7 lumRangeFromJpeg_8_c: 24.0 lumRangeFromJpeg_8_neon: 17.0 lumRangeFromJpeg_24_c: 56.7 lumRangeFromJpeg_24_neon: 21.0 lumRangeFromJpeg_128_c: 294.5 lumRangeFromJpeg_128_neon: 76.7 lumRangeFromJpeg_144_c: 332.5 lumRangeFromJpeg_144_neon: 86.7 lumRangeFromJpeg_256_c: 586.0 lumRangeFromJpeg_256_neon: 152.2 lumRangeFromJpeg_512_c: 1190.0 lumRangeFromJpeg_512_neon: 298.0 lumRangeToJpeg_8_c: 31.7 lumRangeToJpeg_8_neon: 19.5 lumRangeToJpeg_24_c: 83.5 lumRangeToJpeg_24_neon: 24.2 lumRangeToJpeg_128_c: 440.5 lumRangeToJpeg_128_neon: 91.0 lumRangeToJpeg_144_c: 504.2 lumRangeToJpeg_144_neon: 101.0 lumRangeToJpeg_256_c: 879.7 lumRangeToJpeg_256_neon: 177.2 lumRangeToJpeg_512_c: 1794.2 lumRangeToJpeg_512_neon: 354.0	2024-06-18 23:12:41 +02:00
Zhao Zhili	9dac8495b0	swscale/aarch64: Add rgb24 to yuv implementation Test on Apple M1: rgb24_to_uv_8_c: 0.0 rgb24_to_uv_8_neon: 0.2 rgb24_to_uv_128_c: 1.0 rgb24_to_uv_128_neon: 0.5 rgb24_to_uv_1080_c: 7.0 rgb24_to_uv_1080_neon: 5.7 rgb24_to_uv_1920_c: 12.5 rgb24_to_uv_1920_neon: 9.5 rgb24_to_uv_half_8_c: 0.2 rgb24_to_uv_half_8_neon: 0.2 rgb24_to_uv_half_128_c: 1.0 rgb24_to_uv_half_128_neon: 0.5 rgb24_to_uv_half_1080_c: 6.2 rgb24_to_uv_half_1080_neon: 3.0 rgb24_to_uv_half_1920_c: 11.2 rgb24_to_uv_half_1920_neon: 5.2 rgb24_to_y_8_c: 0.2 rgb24_to_y_8_neon: 0.0 rgb24_to_y_128_c: 0.5 rgb24_to_y_128_neon: 0.5 rgb24_to_y_1080_c: 4.7 rgb24_to_y_1080_neon: 3.2 rgb24_to_y_1920_c: 8.0 rgb24_to_y_1920_neon: 5.7 On Pixel 6: rgb24_to_uv_8_c: 30.7 rgb24_to_uv_8_neon: 56.9 rgb24_to_uv_128_c: 213.9 rgb24_to_uv_128_neon: 173.2 rgb24_to_uv_1080_c: 1649.9 rgb24_to_uv_1080_neon: 1424.4 rgb24_to_uv_1920_c: 2907.9 rgb24_to_uv_1920_neon: 2480.7 rgb24_to_uv_half_8_c: 36.2 rgb24_to_uv_half_8_neon: 33.4 rgb24_to_uv_half_128_c: 167.9 rgb24_to_uv_half_128_neon: 99.4 rgb24_to_uv_half_1080_c: 1293.9 rgb24_to_uv_half_1080_neon: 778.7 rgb24_to_uv_half_1920_c: 2292.7 rgb24_to_uv_half_1920_neon: 1328.7 rgb24_to_y_8_c: 19.7 rgb24_to_y_8_neon: 27.7 rgb24_to_y_128_c: 129.9 rgb24_to_y_128_neon: 96.7 rgb24_to_y_1080_c: 995.4 rgb24_to_y_1080_neon: 767.7 rgb24_to_y_1920_c: 1747.4 rgb24_to_y_1920_neon: 1337.2 Note both tests use clang as compiler, which has vectorization enabled by default with -O3. Reviewed-by: Rémi Denis-Courmont <remi@remlab.net> Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-06-11 01:12:09 +08:00
Hubert Mazur	2537fdc510	sw_scale: Add specializations for hscale 16 to 19 Provide arm64 neon optimized implementations for hscale16To19 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_19__fs_4_dstW_512_c: 6216.0 hscale_16_to_19__fs_4_dstW_512_neon: 2257.0 hscale_16_to_19__fs_8_dstW_512_c: 10417.7 hscale_16_to_19__fs_8_dstW_512_neon: 3112.5 hscale_16_to_19__fs_12_dstW_512_c: 14890.5 hscale_16_to_19__fs_12_dstW_512_neon: 3899.0 hscale_16_to_19__fs_16_dstW_512_c: 19006.5 hscale_16_to_19__fs_16_dstW_512_neon: 5341.2 hscale_16_to_19__fs_32_dstW_512_c: 36629.5 hscale_16_to_19__fs_32_dstW_512_neon: 9502.7 hscale_16_to_19__fs_40_dstW_512_c: 45477.5 hscale_16_to_19__fs_40_dstW_512_neon: 11552.0 (Note, the checkasm tests for these functions haven't been merged since they fail on x86.) Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-11-01 15:24:58 +02:00
Hubert Mazur	9ccf8c5bfc	sw_scale: Add specializations for hscale 16 to 15 Add arm64 neon implementations for hscale 16 to 15 with filter sizes 4, 8 and X4. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool are shown below. hscale_16_to_15__fs_4_dstW_512_c: 6703.5 hscale_16_to_15__fs_4_dstW_512_neon: 2298.0 hscale_16_to_15__fs_8_dstW_512_c: 10983.0 hscale_16_to_15__fs_8_dstW_512_neon: 3216.5 hscale_16_to_15__fs_12_dstW_512_c: 15526.0 hscale_16_to_15__fs_12_dstW_512_neon: 3993.0 hscale_16_to_15__fs_16_dstW_512_c: 20183.5 hscale_16_to_15__fs_16_dstW_512_neon: 5369.7 hscale_16_to_15__fs_32_dstW_512_c: 39315.2 hscale_16_to_15__fs_32_dstW_512_neon: 9511.2 hscale_16_to_15__fs_40_dstW_512_c: 48995.7 hscale_16_to_15__fs_40_dstW_512_neon: 11570.0 (Note, the checkasm tests for these functions haven't been merged since they fail on x86.) Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-11-01 15:24:53 +02:00
Hubert Mazur	1e9cfa5bb0	sw_scale: Add specializations for hscale 8 to 19 Add arm64 neon implementations for hscale 8 to 19 with filter sizes 4, 4X and 8. Both implementations are based on very similar ones dedicated to hscale 8 to 15. The major changes refer to saving the data - instead of writing the result as int16_t it is done with int32_t. These functions are heavily inspired on patches provided by J. Swinney and M. Storsjö for hscale8to15 which were slightly adapted for hscale8to19. The tests and benchmarks run on AWS Graviton 2 instances. The results from a checkasm tool shown below. hscale_8_to_19__fs_4_dstW_512_c: 5663.2 hscale_8_to_19__fs_4_dstW_512_neon: 1259.7 hscale_8_to_19__fs_8_dstW_512_c: 9306.0 hscale_8_to_19__fs_8_dstW_512_neon: 2020.2 hscale_8_to_19__fs_12_dstW_512_c: 12932.7 hscale_8_to_19__fs_12_dstW_512_neon: 2462.5 hscale_8_to_19__fs_16_dstW_512_c: 16844.2 hscale_8_to_19__fs_16_dstW_512_neon: 4671.2 hscale_8_to_19__fs_32_dstW_512_c: 32803.7 hscale_8_to_19__fs_32_dstW_512_neon: 5474.2 hscale_8_to_19__fs_40_dstW_512_c: 40948.0 hscale_8_to_19__fs_40_dstW_512_neon: 6669.7 Signed-off-by: Hubert Mazur <hum@semihalf.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-11-01 15:24:43 +02:00
Swinney, Jonathan	0d7caa5b09	swscale/aarch64: add vscale specializations This commit adds new code paths for vscale when filterSize is 2, 4, or 8. By using specialized code with unrolling to match the filterSize we can improve performance. On AWS c7g (Graviton 3, Neoverse V1) instances: before after yuv2yuvX_2_0_512_accurate_neon: 558.8 268.9 yuv2yuvX_4_0_512_accurate_neon: 637.5 434.9 yuv2yuvX_8_0_512_accurate_neon: 1144.8 806.2 yuv2yuvX_16_0_512_accurate_neon: 2080.5 1853.7 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-16 13:40:42 +03:00
Swinney, Jonathan	75ffca7eef	libswscale/aarch64: add another hscale specialization This specialization handles the case where filtersize is 4 mod 8, e.g. 12, 20, etc. Aarch64 was previously using the c function for this case. This implementation speeds up that case significantly. hscale_8_to_15__fs_12_dstW_512_c: 6234.1 hscale_8_to_15__fs_12_dstW_512_neon: 1505.6 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-08-16 12:08:38 +03:00
Swinney, Jonathan	0ea61725b1	swscale/aarch64: add hscale specializations This patch adds code to support specializations of the hscale function and adds a specialization for filterSize == 4. ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck here is loading the data from src, this data is loaded a whole block ahead and stored back to the stack to be loaded again with ld4. This arranges the data for most efficient use of the vector instructions and removes the need for completion adds at the end. The number of iterations of the C per iteration of the assembly is increased from 4 to 8, but because of the prefetching, there must be a special section without prefetching when dstW < 16. This improves speed on Graviton 2 (Neoverse N1) dramatically in the case where previously fs=8 would have been required. before: hscale_8_to_15__fs_8_dstW_512_neon: 1962.8 after : hscale_8_to_15__fs_4_dstW_512_neon: 1220.9 Signed-off-by: Jonathan Swinney <jswinney@amazon.com> Signed-off-by: Martin Storsjö <martin@martin.st>	2022-05-28 01:09:05 +03:00
Andreas Rheinhardt	f3c197b129	Include attributes.h directly Some files currently rely on libavutil/cpu.h to include it for them; yet said file won't use include it any more after the currently deprecated functions are removed, so include attributes.h directly. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2021-04-19 14:34:10 +02:00
Josh de Kock	718c8f9aa5	swscale: fix NEON hscale init The NEON hscale function only supports X8 filter sizes and should only be selected when these are being used. At the moment filterAlign is set to 8 but in the future when extra NEON assembly for specific sizes is added they will need to have checks here too. The immediate usecase for this change is making the hscale checkasm test easier and without NEON specific edge-cases (x86 already has these guards). Signed-off-by: Josh de Kock <josh@itanimul.li>	2020-05-15 10:29:30 +01:00
Clément Bœsch	c921f4f687	sws/aarch64: add ff_yuv2planeX_8_neon	2016-04-11 16:27:19 +02:00
Clément Bœsch	040598218f	sws/aarch64: restore ff_hscale_8_to_15_neon() Fix final scaling and required filter alignment. Pass FATE.	2016-04-05 12:00:36 +02:00
Clément Bœsch	eadaef2a63	sws/aarch64: disable ff_hscale_8_to_15_neon temporarly Looks broken.	2016-04-01 17:33:01 +02:00
Clément Bœsch	263eb76bdf	sws/aarch64: add ff_hscale_8_to_15_neon ./ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null - before: t:0.489726 avg:0.489883 max:0.491852 min:0.489482 after: t:0.256515 avg:0.256458 max:0.256999 min:0.253755	2016-03-31 10:12:55 +02:00

31 Commits