FFmpeg

mirror of https://mirror.skon.top/https://github.com/FFmpeg/FFmpeg synced 2026-05-01 06:13:08 +08:00

Author	SHA1	Message	Date
Martin Storsjö	74cfcd1c69	aarch64/vvc: Fix DCE undefined references with MSVC This fixes compiling with MSVC for aarch64 after `510999f6b0`. While MSVC does do dead code elimintation for function references within e.g. "if (0)", it doesn't do that for functions referenced within a static function, even if that static function itself ends up not used. A reproduction example: void missing(void); void (*func_ptr)(void); static void wrapper(void) { missing(); } void init(int cpu_flags) { if (0) { func_ptr = wrapper; } } If "wrapper" is entirely unreferenced, then MSVC doesn't produce any reference to the symbol "missing". Also, if we do "func_ptr = missing;" then the reference to missing also is eliminated. But for the case of referencing the function in a static function, even if the reference to the static function can be eliminated, then MSVC does keep the reference to the symbol.	2026-03-05 11:57:40 +02:00
Georgii Zagoruiko	510999f6b0	aarch64/vvc: sme2 optimisation of alf_filter_luma() 8/10/12 bit Apple M4: vvc_alf_filter_luma_8x8_8_c: 347.3 ( 1.00x) vvc_alf_filter_luma_8x8_8_neon: 138.7 ( 2.50x) vvc_alf_filter_luma_8x8_8_sme2: 134.5 ( 2.58x) vvc_alf_filter_luma_8x8_10_c: 299.8 ( 1.00x) vvc_alf_filter_luma_8x8_10_neon: 129.8 ( 2.31x) vvc_alf_filter_luma_8x8_10_sme2: 128.6 ( 2.33x) vvc_alf_filter_luma_8x8_12_c: 293.0 ( 1.00x) vvc_alf_filter_luma_8x8_12_neon: 126.8 ( 2.31x) vvc_alf_filter_luma_8x8_12_sme2: 126.3 ( 2.32x) vvc_alf_filter_luma_16x16_8_c: 1386.1 ( 1.00x) vvc_alf_filter_luma_16x16_8_neon: 560.3 ( 2.47x) vvc_alf_filter_luma_16x16_8_sme2: 540.1 ( 2.57x) vvc_alf_filter_luma_16x16_10_c: 1200.3 ( 1.00x) vvc_alf_filter_luma_16x16_10_neon: 515.6 ( 2.33x) vvc_alf_filter_luma_16x16_10_sme2: 531.3 ( 2.26x) vvc_alf_filter_luma_16x16_12_c: 1223.8 ( 1.00x) vvc_alf_filter_luma_16x16_12_neon: 510.7 ( 2.40x) vvc_alf_filter_luma_16x16_12_sme2: 524.9 ( 2.33x) vvc_alf_filter_luma_32x32_8_c: 5488.8 ( 1.00x) vvc_alf_filter_luma_32x32_8_neon: 2233.4 ( 2.46x) vvc_alf_filter_luma_32x32_8_sme2: 1093.6 ( 5.02x) vvc_alf_filter_luma_32x32_10_c: 4738.0 ( 1.00x) vvc_alf_filter_luma_32x32_10_neon: 2057.5 ( 2.30x) vvc_alf_filter_luma_32x32_10_sme2: 1053.6 ( 4.50x) vvc_alf_filter_luma_32x32_12_c: 4808.3 ( 1.00x) vvc_alf_filter_luma_32x32_12_neon: 1981.2 ( 2.43x) vvc_alf_filter_luma_32x32_12_sme2: 1047.7 ( 4.59x) vvc_alf_filter_luma_64x64_8_c: 22116.8 ( 1.00x) vvc_alf_filter_luma_64x64_8_neon: 8951.0 ( 2.47x) vvc_alf_filter_luma_64x64_8_sme2: 4225.2 ( 5.23x) vvc_alf_filter_luma_64x64_10_c: 19072.8 ( 1.00x) vvc_alf_filter_luma_64x64_10_neon: 8448.1 ( 2.26x) vvc_alf_filter_luma_64x64_10_sme2: 4225.8 ( 4.51x) vvc_alf_filter_luma_64x64_12_c: 19312.6 ( 1.00x) vvc_alf_filter_luma_64x64_12_neon: 8270.9 ( 2.34x) vvc_alf_filter_luma_64x64_12_sme2: 4245.4 ( 4.55x) vvc_alf_filter_luma_128x128_8_c: 88530.5 ( 1.00x) vvc_alf_filter_luma_128x128_8_neon: 35686.3 ( 2.48x) vvc_alf_filter_luma_128x128_8_sme2: 16961.2 ( 5.22x) vvc_alf_filter_luma_128x128_10_c: 76904.9 ( 1.00x) vvc_alf_filter_luma_128x128_10_neon: 32439.5 ( 2.37x) vvc_alf_filter_luma_128x128_10_sme2: 16845.6 ( 4.57x) vvc_alf_filter_luma_128x128_12_c: 77363.3 ( 1.00x) vvc_alf_filter_luma_128x128_12_neon: 32907.5 ( 2.35x) vvc_alf_filter_luma_128x128_12_sme2: 17018.1 ( 4.55x)	2026-03-04 23:52:58 +02:00
Georgii Zagoruiko	90431417cb	aarch64/vvc: Optimisations of put_luma_hv() functions for 10/12-bit Apple M2: put_luma_hv_10_4x4_c: 36.3 ( 1.00x) put_luma_hv_10_8x8_c: 82.9 ( 1.00x) put_luma_hv_10_8x8_neon: 34.9 ( 2.37x) put_luma_hv_10_16x16_c: 239.2 ( 1.00x) put_luma_hv_10_16x16_neon: 119.0 ( 2.01x) put_luma_hv_10_32x32_c: 900.3 ( 1.00x) put_luma_hv_10_32x32_neon: 429.3 ( 2.10x) put_luma_hv_10_64x64_c: 2984.7 ( 1.00x) put_luma_hv_10_64x64_neon: 1736.2 ( 1.72x) put_luma_hv_10_128x128_c: 11194.2 ( 1.00x) put_luma_hv_10_128x128_neon: 6357.3 ( 1.76x) put_luma_hv_12_4x4_c: 35.9 ( 1.00x) put_luma_hv_12_8x8_c: 82.6 ( 1.00x) put_luma_hv_12_8x8_neon: 34.3 ( 2.41x) put_luma_hv_12_16x16_c: 240.2 ( 1.00x) put_luma_hv_12_16x16_neon: 115.3 ( 2.08x) put_luma_hv_12_32x32_c: 787.7 ( 1.00x) put_luma_hv_12_32x32_neon: 414.2 ( 1.90x) put_luma_hv_12_64x64_c: 3058.4 ( 1.00x) put_luma_hv_12_64x64_neon: 1592.3 ( 1.92x) put_luma_hv_12_128x128_c: 11350.8 ( 1.00x) put_luma_hv_12_128x128_neon: 6378.3 ( 1.78x) RPi4: put_luma_hv_10_4x4_c: 637.8 ( 1.00x) put_luma_hv_10_8x8_c: 1044.9 ( 1.00x) put_luma_hv_10_8x8_neon: 483.7 ( 2.16x) put_luma_hv_10_16x16_c: 3098.0 ( 1.00x) put_luma_hv_10_16x16_neon: 1603.1 ( 1.93x) put_luma_hv_10_32x32_c: 10054.8 ( 1.00x) put_luma_hv_10_32x32_neon: 5843.6 ( 1.72x) put_luma_hv_10_64x64_c: 40506.2 ( 1.00x) put_luma_hv_10_64x64_neon: 24384.0 ( 1.66x) put_luma_hv_10_128x128_c: 130604.2 ( 1.00x) put_luma_hv_10_128x128_neon: 99746.6 ( 1.31x) put_luma_hv_12_4x4_c: 638.2 ( 1.00x) put_luma_hv_12_8x8_c: 1074.6 ( 1.00x) put_luma_hv_12_8x8_neon: 482.6 ( 2.23x) put_luma_hv_12_16x16_c: 3094.0 ( 1.00x) put_luma_hv_12_16x16_neon: 1602.5 ( 1.93x) put_luma_hv_12_32x32_c: 10034.4 ( 1.00x) put_luma_hv_12_32x32_neon: 5843.3 ( 1.72x) put_luma_hv_12_64x64_c: 40447.5 ( 1.00x) put_luma_hv_12_64x64_neon: 24377.2 ( 1.66x) put_luma_hv_12_128x128_c: 130610.4 ( 1.00x) put_luma_hv_12_128x128_neon: 99765.8 ( 1.31x)	2026-03-04 12:53:16 +00:00
Andreas Rheinhardt	dc65dcec22	avcodec/vvc/inter: Combine offsets early For bi-predicted weighted averages, only the sum of the two offsets is ever used, so add the two early. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2026-02-25 12:08:33 +01:00
Georgii Zagoruiko	8acdffa22c	aarch64/vvc: Optimisations of put_luma_v() functions for 10/12-bit RPi4 (auto-vectorisation is on) put_luma_v_10_4x4_c: 303.3 ( 1.00x) put_luma_v_10_4x4_neon: 55.7 ( 5.45x) put_luma_v_10_8x8_c: 1106.7 ( 1.00x) put_luma_v_10_8x8_neon: 163.8 ( 6.76x) put_luma_v_10_16x16_c: 2242.1 ( 1.00x) put_luma_v_10_16x16_neon: 672.7 ( 3.33x) put_luma_v_10_32x32_c: 7057.3 ( 1.00x) put_luma_v_10_32x32_neon: 2731.3 ( 2.58x) put_luma_v_10_64x64_c: 25699.8 ( 1.00x) put_luma_v_10_64x64_neon: 12145.6 ( 2.12x) put_luma_v_10_128x128_c: 90694.6 ( 1.00x) put_luma_v_10_128x128_neon: 44862.4 ( 2.02x) put_luma_v_12_4x4_c: 304.4 ( 1.00x) put_luma_v_12_4x4_neon: 55.6 ( 5.47x) put_luma_v_12_8x8_c: 1107.4 ( 1.00x) put_luma_v_12_8x8_neon: 164.7 ( 6.72x) put_luma_v_12_16x16_c: 2235.8 ( 1.00x) put_luma_v_12_16x16_neon: 672.5 ( 3.32x) put_luma_v_12_32x32_c: 7049.2 ( 1.00x) put_luma_v_12_32x32_neon: 2731.6 ( 2.58x) put_luma_v_12_64x64_c: 25706.5 ( 1.00x) put_luma_v_12_64x64_neon: 12145.0 ( 2.12x) put_luma_v_12_128x128_c: 90672.5 ( 1.00x) put_luma_v_12_128x128_neon: 44857.1 ( 2.02x) Apple M4 (auto-vectorisation is on): put_luma_v_10_4x4_c: 25.6 ( 1.00x) put_luma_v_10_4x4_neon: 3.1 ( 8.18x) put_luma_v_10_8x8_c: 34.7 ( 1.00x) put_luma_v_10_8x8_neon: 10.5 ( 3.32x) put_luma_v_10_16x16_c: 103.9 ( 1.00x) put_luma_v_10_16x16_neon: 42.3 ( 2.45x) put_luma_v_10_32x32_c: 399.7 ( 1.00x) put_luma_v_10_32x32_neon: 161.8 ( 2.47x) put_luma_v_10_64x64_c: 1276.7 ( 1.00x) put_luma_v_10_64x64_neon: 840.1 ( 1.52x) put_luma_v_10_128x128_c: 4981.3 ( 1.00x) put_luma_v_10_128x128_neon: 3008.0 ( 1.66x) put_luma_v_12_4x4_c: 23.6 ( 1.00x) put_luma_v_12_4x4_neon: 2.0 (11.84x) put_luma_v_12_8x8_c: 31.8 ( 1.00x) put_luma_v_12_8x8_neon: 12.4 ( 2.55x) put_luma_v_12_16x16_c: 100.8 ( 1.00x) put_luma_v_12_16x16_neon: 44.9 ( 2.25x) put_luma_v_12_32x32_c: 331.1 ( 1.00x) put_luma_v_12_32x32_neon: 175.2 ( 1.89x) put_luma_v_12_64x64_c: 1227.1 ( 1.00x) put_luma_v_12_64x64_neon: 712.7 ( 1.72x) put_luma_v_12_128x128_c: 5149.1 ( 1.00x) put_luma_v_12_128x128_neon: 2809.3 ( 1.83x)	2026-01-08 17:35:55 +00:00
Georgii Zagoruiko	f790de2a87	aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit RPi4 (auto-vectorisation is turned on) put_luma_h_10_4x4_c: 282.8 ( 1.00x) put_luma_h_10_8x8_c: 1069.5 ( 1.00x) put_luma_h_10_8x8_neon: 207.5 ( 5.15x) put_luma_h_10_16x16_c: 1999.6 ( 1.00x) put_luma_h_10_16x16_neon: 777.5 ( 2.57x) put_luma_h_10_32x32_c: 6612.9 ( 1.00x) put_luma_h_10_32x32_neon: 3201.6 ( 2.07x) put_luma_h_10_64x64_c: 25059.0 ( 1.00x) put_luma_h_10_64x64_neon: 13623.5 ( 1.84x) put_luma_h_10_128x128_c: 91310.1 ( 1.00x) put_luma_h_10_128x128_neon: 50358.3 ( 1.81x) put_luma_h_12_4x4_c: 282.1 ( 1.00x) put_luma_h_12_8x8_c: 1068.4 ( 1.00x) put_luma_h_12_8x8_neon: 207.7 ( 5.14x) put_luma_h_12_16x16_c: 1998.0 ( 1.00x) put_luma_h_12_16x16_neon: 777.5 ( 2.57x) put_luma_h_12_32x32_c: 6612.0 ( 1.00x) put_luma_h_12_32x32_neon: 3201.6 ( 2.07x) put_luma_h_12_64x64_c: 25036.8 ( 1.00x) put_luma_h_12_64x64_neon: 13595.1 ( 1.84x) put_luma_h_12_128x128_c: 91305.8 ( 1.00x) put_luma_h_12_128x128_neon: 50359.7 ( 1.81x) Apple M2 Air (auto-vectorisation is turned on) put_luma_h_10_4x4_c: 0.3 ( 1.00x) put_luma_h_10_8x8_c: 1.0 ( 1.00x) put_luma_h_10_8x8_neon: 0.4 ( 2.59x) put_luma_h_10_16x16_c: 2.9 ( 1.00x) put_luma_h_10_16x16_neon: 1.4 ( 2.01x) put_luma_h_10_32x32_c: 9.4 ( 1.00x) put_luma_h_10_32x32_neon: 5.8 ( 1.62x) put_luma_h_10_64x64_c: 35.6 ( 1.00x) put_luma_h_10_64x64_neon: 23.6 ( 1.51x) put_luma_h_10_128x128_c: 131.1 ( 1.00x) put_luma_h_10_128x128_neon: 92.6 ( 1.42x) put_luma_h_12_4x4_c: 0.3 ( 1.00x) put_luma_h_12_8x8_c: 1.0 ( 1.00x) put_luma_h_12_8x8_neon: 0.4 ( 2.58x) put_luma_h_12_16x16_c: 2.9 ( 1.00x) put_luma_h_12_16x16_neon: 1.4 ( 2.00x) put_luma_h_12_32x32_c: 9.4 ( 1.00x) put_luma_h_12_32x32_neon: 5.8 ( 1.61x) put_luma_h_12_64x64_c: 35.3 ( 1.00x) put_luma_h_12_64x64_neon: 23.3 ( 1.52x) put_luma_h_12_128x128_c: 131.2 ( 1.00x) put_luma_h_12_128x128_neon: 92.4 ( 1.42x)	2025-11-24 21:22:55 +00:00
Krzysztof Pyrkosz	03c054d43c	avcodec/aarch64/vvc: Implement dmvr_v_8 A72 dmvr_v_8_12x20_neon: 207.0 ( 4.15x) dmvr_v_8_20x12_neon: 170.4 ( 4.37x) dmvr_v_8_20x20_neon: 273.4 ( 4.58x) A53 dmvr_v_8_12x20_neon: 450.6 ( 4.21x) dmvr_v_8_20x12_neon: 342.8 ( 3.70x) dmvr_v_8_20x20_neon: 550.9 ( 3.79x)	2025-09-23 11:20:20 +00:00
Krzysztof Pyrkosz	56a638d836	avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon Before and after: A53: apply_bdof_8_16x8_neon: 2733.1 ( 4.88x) apply_bdof_8_16x16_neon: 5458.6 ( 4.86x) apply_bdof_10_16x8_neon: 2789.8 ( 4.64x) apply_bdof_10_16x16_neon: 5523.8 ( 4.68x) apply_bdof_12_16x8_neon: 2792.8 ( 4.58x) apply_bdof_12_16x16_neon: 5519.5 ( 4.63x) apply_bdof_8_16x8_neon: 2571.8 ( 5.12x) apply_bdof_8_16x16_neon: 5173.3 ( 5.12x) apply_bdof_10_16x8_neon: 2635.1 ( 4.87x) apply_bdof_10_16x16_neon: 5243.0 ( 4.89x) apply_bdof_12_16x8_neon: 2613.0 ( 4.89x) apply_bdof_12_16x16_neon: 5231.7 ( 4.90x) A78: apply_bdof_8_16x8_neon: 565.3 ( 8.43x) apply_bdof_8_16x16_neon: 1109.5 ( 8.60x) apply_bdof_10_16x8_neon: 568.2 ( 7.92x) apply_bdof_10_16x16_neon: 1114.1 ( 8.08x) apply_bdof_12_16x8_neon: 570.2 ( 7.87x) apply_bdof_12_16x16_neon: 1116.3 ( 8.03x) apply_bdof_8_16x8_neon: 541.4 ( 8.81x) apply_bdof_8_16x16_neon: 1065.9 ( 8.97x) apply_bdof_10_16x8_neon: 543.2 ( 8.32x) apply_bdof_10_16x16_neon: 1071.5 ( 8.39x) apply_bdof_12_16x8_neon: 544.2 ( 8.25x) apply_bdof_12_16x16_neon: 1074.1 ( 8.37x)	2025-09-23 11:20:11 +00:00
Krzysztof Pyrkosz	f1a155d975	avcodec/aarch64/vvc: Optimize dmvr_hv_10 Before and after on A53: dmvr_hv_10_12x20_neon: 1838.2 ( 3.02x) dmvr_hv_10_20x12_neon: 1330.2 ( 1.83x) dmvr_hv_10_20x20_neon: 2148.2 ( 1.85x) dmvr_hv_12_12x20_neon: 1839.2 ( 3.02x) dmvr_hv_12_20x12_neon: 1330.6 ( 1.83x) dmvr_hv_12_20x20_neon: 2147.2 ( 1.85x) dmvr_hv_10_12x20_neon: 1755.0 ( 3.17x) dmvr_hv_10_20x12_neon: 1165.8 ( 2.09x) dmvr_hv_10_20x20_neon: 1876.1 ( 2.12x) dmvr_hv_12_12x20_neon: 1754.4 ( 3.17x) dmvr_hv_12_20x12_neon: 1167.8 ( 2.09x) dmvr_hv_12_20x20_neon: 1878.8 ( 2.12x)	2025-09-21 19:39:27 +00:00
Georgii Zagoruiko	4fbacb3944	avcodec/aarch64/vvc: Optimised version of classify function. Macbook Air (M2): vvc_alf_classify_8x8_8_c: 2.6 ( 1.00x) vvc_alf_classify_8x8_8_neon: 1.0 ( 2.47x) vvc_alf_classify_8x8_10_c: 2.7 ( 1.00x) vvc_alf_classify_8x8_10_neon: 0.9 ( 2.98x) vvc_alf_classify_8x8_12_c: 2.7 ( 1.00x) vvc_alf_classify_8x8_12_neon: 0.9 ( 2.97x) vvc_alf_classify_16x16_8_c: 7.3 ( 1.00x) vvc_alf_classify_16x16_8_neon: 3.4 ( 2.12x) vvc_alf_classify_16x16_10_c: 4.3 ( 1.00x) vvc_alf_classify_16x16_10_neon: 2.9 ( 1.47x) vvc_alf_classify_16x16_12_c: 4.3 ( 1.00x) vvc_alf_classify_16x16_12_neon: 3.0 ( 1.44x) vvc_alf_classify_32x32_8_c: 13.7 ( 1.00x) vvc_alf_classify_32x32_8_neon: 10.7 ( 1.29x) vvc_alf_classify_32x32_10_c: 12.3 ( 1.00x) vvc_alf_classify_32x32_10_neon: 8.7 ( 1.42x) vvc_alf_classify_32x32_12_c: 12.2 ( 1.00x) vvc_alf_classify_32x32_12_neon: 8.7 ( 1.40x) vvc_alf_classify_64x64_8_c: 45.8 ( 1.00x) vvc_alf_classify_64x64_8_neon: 37.1 ( 1.23x) vvc_alf_classify_64x64_10_c: 41.3 ( 1.00x) vvc_alf_classify_64x64_10_neon: 32.8 ( 1.26x) vvc_alf_classify_64x64_12_c: 41.4 ( 1.00x) vvc_alf_classify_64x64_12_neon: 32.4 ( 1.28x) vvc_alf_classify_128x128_8_c: 163.7 ( 1.00x) vvc_alf_classify_128x128_8_neon: 138.3 ( 1.18x) vvc_alf_classify_128x128_10_c: 149.1 ( 1.00x) vvc_alf_classify_128x128_10_neon: 120.3 ( 1.24x) vvc_alf_classify_128x128_12_c: 148.7 ( 1.00x) vvc_alf_classify_128x128_12_neon: 119.4 ( 1.25x) RPi4 (Cortex-A72): vvc_alf_classify_8x8_8_c: 1251.6 ( 1.00x) vvc_alf_classify_8x8_8_neon: 700.7 ( 1.79x) vvc_alf_classify_8x8_10_c: 1141.9 ( 1.00x) vvc_alf_classify_8x8_10_neon: 659.7 ( 1.73x) vvc_alf_classify_8x8_12_c: 1075.8 ( 1.00x) vvc_alf_classify_8x8_12_neon: 658.7 ( 1.63x) vvc_alf_classify_16x16_8_c: 3574.1 ( 1.00x) vvc_alf_classify_16x16_8_neon: 1849.8 ( 1.93x) vvc_alf_classify_16x16_10_c: 3270.0 ( 1.00x) vvc_alf_classify_16x16_10_neon: 1786.1 ( 1.83x) vvc_alf_classify_16x16_12_c: 3271.7 ( 1.00x) vvc_alf_classify_16x16_12_neon: 1785.5 ( 1.83x) vvc_alf_classify_32x32_8_c: 12451.9 ( 1.00x) vvc_alf_classify_32x32_8_neon: 5984.3 ( 2.08x) vvc_alf_classify_32x32_10_c: 11428.9 ( 1.00x) vvc_alf_classify_32x32_10_neon: 5756.3 ( 1.99x) vvc_alf_classify_32x32_12_c: 11252.8 ( 1.00x) vvc_alf_classify_32x32_12_neon: 5755.7 ( 1.96x) vvc_alf_classify_64x64_8_c: 47625.5 ( 1.00x) vvc_alf_classify_64x64_8_neon: 21071.9 ( 2.26x) vvc_alf_classify_64x64_10_c: 44576.3 ( 1.00x) vvc_alf_classify_64x64_10_neon: 21544.7 ( 2.07x) vvc_alf_classify_64x64_12_c: 44600.5 ( 1.00x) vvc_alf_classify_64x64_12_neon: 21491.2 ( 2.08x) vvc_alf_classify_128x128_8_c: 192143.3 ( 1.00x) vvc_alf_classify_128x128_8_neon: 82387.6 ( 2.33x) vvc_alf_classify_128x128_10_c: 177583.1 ( 1.00x) vvc_alf_classify_128x128_10_neon: 81628.8 ( 2.18x) vvc_alf_classify_128x128_12_c: 177582.2 ( 1.00x) vvc_alf_classify_128x128_12_neon: 81625.1 ( 2.18x)	2025-09-09 22:13:04 +01:00
Krzysztof Pyrkosz	de25cb4603	avcodec/aarch64/vvc: Optimize vvc_apply_bdof_block_8x Before and after: A53: apply_bdof_8_8x16_neon: 3320.5 ( 4.02x) apply_bdof_10_8x16_neon: 3317.8 ( 3.90x) apply_bdof_12_8x16_neon: 3303.6 ( 3.91x) apply_bdof_8_8x16_neon: 3168.1 ( 4.23x) apply_bdof_10_8x16_neon: 3127.8 ( 4.13x) apply_bdof_12_8x16_neon: 3119.3 ( 4.18x) A72: apply_bdof_8_8x16_neon: 1827.4 ( 5.02x) apply_bdof_10_8x16_neon: 1838.5 ( 4.89x) apply_bdof_12_8x16_neon: 1841.1 ( 4.83x) apply_bdof_8_8x16_neon: 1691.6 ( 5.46x) apply_bdof_10_8x16_neon: 1695.9 ( 5.23x) apply_bdof_12_8x16_neon: 1695.4 ( 5.29x) A78 apply_bdof_8_8x16_neon: 648.9 ( 7.43x) apply_bdof_10_8x16_neon: 646.1 ( 7.04x) apply_bdof_12_8x16_neon: 643.8 ( 7.04x) apply_bdof_8_8x16_neon: 603.2 ( 7.97x) apply_bdof_10_8x16_neon: 604.1 ( 7.52x) apply_bdof_12_8x16_neon: 604.5 ( 7.52x)	2025-09-09 16:37:28 +00:00
Krzysztof Pyrkosz	7b21bde34c	avcodec/aarch64/vvc: Implemented dmvr_h_10 A78: dmvr_h_10_12x20_neon: 82.2 ( 6.49x) dmvr_h_10_20x12_neon: 69.9 ( 3.66x) dmvr_h_10_20x20_neon: 112.5 ( 3.74x) dmvr_h_12_12x20_neon: 81.4 ( 6.51x) dmvr_h_12_20x12_neon: 69.2 ( 3.74x) dmvr_h_12_20x20_neon: 110.2 ( 3.85x) A72: dmvr_h_10_12x20_neon: 234.1 ( 4.67x) dmvr_h_10_20x12_neon: 221.4 ( 3.48x) dmvr_h_10_20x20_neon: 356.9 ( 3.59x) dmvr_h_12_12x20_neon: 234.1 ( 4.67x) dmvr_h_12_20x12_neon: 221.5 ( 3.53x) dmvr_h_12_20x20_neon: 357.0 ( 3.64x)	2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz	189e841cfd	avcodec/aarch64/vvc: Implement dmvr_h_8 A78: dmvr_h_8_12x20_neon: 76.6 ( 4.31x) dmvr_h_8_20x12_neon: 65.8 ( 3.49x) dmvr_h_8_20x20_neon: 106.6 ( 3.62x) A72: dmvr_h_8_12x20_neon: 190.6 ( 4.40x) dmvr_h_8_20x12_neon: 171.1 ( 4.31x) dmvr_h_8_20x20_neon: 275.1 ( 4.50x)	2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz	fb4407797e	Replace uxtl with umull in dmvr_hv_8 Before and after on A78: dmvr_hv_8_12x20_neon: 205.3 ( 5.21x) dmvr_hv_8_20x12_neon: 171.8 ( 3.15x) dmvr_hv_8_20x20_neon: 282.7 ( 3.11x) dmvr_hv_8_12x20_neon: 172.7 ( 5.58x) dmvr_hv_8_20x12_neon: 133.3 ( 3.36x) dmvr_hv_8_20x20_neon: 214.6 ( 3.40x)	2025-09-05 07:20:15 +00:00
Zhao Zhili	6ce02bcc3a	avcodec/aarch64/vvc: Optimize apply_bdof Before this patch, prof_grad_filter calculate gh[0], gh[1], gv[0], gv[1] and save them to stack. derive_bdof_vx_vy load them from stack and calculate gh[0] + gh[1], gv[0] + gv[1]. apply_bdof_min_block load them from stack and calculate gh[0] - gh[1], gv[0] - gv[1] This patch add bdof_grad_filter, which calculate gh[0] + gh[1], gh[0] - gh[1], gv[0] + gv[1], gv[0] - gv[1], and save them to stack, so derive_bdof_vx_vy and apply_bdof_min_block can use the results directly. prof_grad_filter is kept for reuse by other functions in the future. Benchmark on rpi5 with gcc 12 Before After -------------------------------------------------------------------- apply_bdof_8_8x16_c: \| 7431.4 ( 1.00x) \| 7371.7 ( 1.00x) apply_bdof_8_8x16_neon: \| 1175.4 ( 6.32x) \| 1036.3 ( 7.11x) apply_bdof_8_16x8_c: \| 7182.2 ( 1.00x) \| 7201.1 ( 1.00x) apply_bdof_8_16x8_neon: \| 1021.7 ( 7.03x) \| 879.9 ( 8.18x) apply_bdof_8_16x16_c: \| 14577.1 ( 1.00x) \| 14589.3 ( 1.00x) apply_bdof_8_16x16_neon: \| 2012.8 ( 7.24x) \| 1743.3 ( 8.37x) apply_bdof_10_8x16_c: \| 7292.4 ( 1.00x) \| 7308.5 ( 1.00x) apply_bdof_10_8x16_neon: \| 1156.3 ( 6.31x) \| 1045.3 ( 6.99x) apply_bdof_10_16x8_c: \| 7112.4 ( 1.00x) \| 7214.4 ( 1.00x) apply_bdof_10_16x8_neon: \| 1007.6 ( 7.06x) \| 904.8 ( 7.97x) apply_bdof_10_16x16_c: \| 14363.3 ( 1.00x) \| 14476.4 ( 1.00x) apply_bdof_10_16x16_neon: \| 1986.9 ( 7.23x) \| 1783.1 ( 8.12x) apply_bdof_12_8x16_c: \| 7433.3 ( 1.00x) \| 7374.7 ( 1.00x) apply_bdof_12_8x16_neon: \| 1155.9 ( 6.43x) \| 1040.8 ( 7.09x) apply_bdof_12_16x8_c: \| 7171.1 ( 1.00x) \| 7376.3 ( 1.00x) apply_bdof_12_16x8_neon: \| 1010.8 ( 7.09x) \| 899.4 ( 8.20x) apply_bdof_12_16x16_c: \| 14515.5 ( 1.00x) \| 14731.5 ( 1.00x) apply_bdof_12_16x16_neon: \| 1988.4 ( 7.30x) \| 1785.2 ( 8.25x)	2025-09-03 06:55:37 +00:00
Zhao Zhili	2e92417603	avcodec/aarch64/vvc: Optimize derive_bdof_vx_vy Implement line tricks and pixel tricks. See comments in inter.S for details. Benchmark on rpi5 with gcc 12 Before After ----------------------------------------------------------------- apply_bdof_8_8x16_c: \| 7375.5 ( 1.00x) \| 7473.8 ( 1.00x) apply_bdof_8_8x16_neon: \| 1875.1 ( 3.93x) \| 1135.8 ( 6.58x) apply_bdof_8_16x8_c: \| 7273.9 ( 1.00x) \| 7204.0 ( 1.00x) apply_bdof_8_16x8_neon: \| 1738.2 ( 4.18x) \| 1013.0 ( 7.11x) apply_bdof_8_16x16_c: \| 14744.9 ( 1.00x) \| 14712.6 ( 1.00x) apply_bdof_8_16x16_neon: \| 3446.7 ( 4.28x) \| 1997.7 ( 7.36x) apply_bdof_10_8x16_c: \| 7352.4 ( 1.00x) \| 7485.7 ( 1.00x) apply_bdof_10_8x16_neon: \| 1861.0 ( 3.95x) \| 1134.1 ( 6.60x) apply_bdof_10_16x8_c: \| 7330.5 ( 1.00x) \| 7232.8 ( 1.00x) apply_bdof_10_16x8_neon: \| 1747.2 ( 4.20x) \| 1002.6 ( 7.21x) apply_bdof_10_16x16_c: \| 14522.4 ( 1.00x) \| 14664.8 ( 1.00x) apply_bdof_10_16x16_neon: \| 3490.5 ( 4.16x) \| 1978.4 ( 7.41x) apply_bdof_12_8x16_c: \| 7389.0 ( 1.00x) \| 7380.1 ( 1.00x) apply_bdof_12_8x16_neon: \| 1861.3 ( 3.97x) \| 1134.0 ( 6.51x) apply_bdof_12_16x8_c: \| 7283.1 ( 1.00x) \| 7336.9 ( 1.00x) apply_bdof_12_16x8_neon: \| 1749.1 ( 4.16x) \| 1002.3 ( 7.32x) apply_bdof_12_16x16_c: \| 14580.7 ( 1.00x) \| 14502.7 ( 1.00x) apply_bdof_12_16x16_neon: \| 3472.9 ( 4.20x) \| 1978.3 ( 7.33x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-09-03 06:55:37 +00:00
Zhao Zhili	39786f8cd5	aarch64/h26x: optimize sao_band_filter int8_t[] is enough for offset_table of 8 bit streams. On rpi5: Before After hevc_sao_band_8_8_c: 252.3 ( 1.00x) 252.3 ( 1.00x) hevc_sao_band_8_8_neon: 95.8 ( 2.63x) 61.0 ( 4.57x) hevc_sao_band_16_8_c: 875.2 ( 1.00x) 864.9 ( 1.00x) hevc_sao_band_16_8_neon: 317.5 ( 2.76x) 150.0 ( 6.26x) hevc_sao_band_32_8_c: 3853.5 ( 1.00x) 3871.6 ( 1.00x) hevc_sao_band_32_8_neon: 1222.3 ( 3.15x) 550.6 ( 7.39) hevc_sao_band_48_8_c: 8203.6 ( 1.00x) 8182.6 ( 1.00x) hevc_sao_band_48_8_neon: 2685.7 ( 3.05x) 1185.8 ( 7.36x) hevc_sao_band_64_8_c: 14023.0 ( 1.00x) 14038.9 ( 1.00x) hevc_sao_band_64_8_neon: 4783.2 ( 2.93x) 2078.4 ( 7.15x) Reviewed-by: Martin Storsjö <martin@martin.st> Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2025-04-29 15:11:45 +08:00
Krzysztof Pyrkosz	f9b8f30680	avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} This patch replaces integer widening with halving addition, and multi-step "emulated" rounding shift with a single asm instruction doing exactly that. Benchmarks before and after: A78 avg_8_64x64_neon: 2686.2 ( 6.12x) avg_8_128x128_neon: 10734.2 ( 5.88x) avg_10_64x64_neon: 2536.8 ( 5.40x) avg_10_128x128_neon: 10079.0 ( 5.22x) avg_12_64x64_neon: 2548.2 ( 5.38x) avg_12_128x128_neon: 10133.8 ( 5.19x) avg_8_64x64_neon: 897.8 (18.26x) avg_8_128x128_neon: 3608.5 (17.37x) avg_10_32x32_neon: 444.2 ( 8.51x) avg_10_64x64_neon: 1711.8 ( 8.00x) avg_12_64x64_neon: 1706.2 ( 8.02x) avg_12_128x128_neon: 7010.0 ( 7.46x) A72 avg_8_64x64_neon: 5823.4 ( 3.88x) avg_8_128x128_neon: 17430.5 ( 4.73x) avg_10_64x64_neon: 5228.1 ( 3.71x) avg_10_128x128_neon: 16722.2 ( 4.17x) avg_12_64x64_neon: 5379.1 ( 3.51x) avg_12_128x128_neon: 16715.7 ( 4.17x) avg_8_64x64_neon: 2006.5 (10.61x) avg_8_128x128_neon: 9158.7 ( 8.96x) avg_10_64x64_neon: 3357.7 ( 5.60x) avg_10_128x128_neon: 12411.7 ( 5.56x) avg_12_64x64_neon: 3317.5 ( 5.67x) avg_12_128x128_neon: 12358.5 ( 5.58x) A53 avg_8_64x64_neon: 8327.8 ( 5.18x) avg_8_128x128_neon: 31631.3 ( 5.34x) avg_10_64x64_neon: 8783.5 ( 4.98x) avg_10_128x128_neon: 32617.0 ( 5.25x) avg_12_64x64_neon: 8686.0 ( 5.06x) avg_12_128x128_neon: 32487.5 ( 5.25x) avg_8_64x64_neon: 6032.3 ( 7.17x) avg_8_128x128_neon: 22008.5 ( 7.69x) avg_10_64x64_neon: 7738.0 ( 5.68x) avg_10_128x128_neon: 27813.8 ( 6.14x) avg_12_64x64_neon: 7844.5 ( 5.60x) avg_12_128x128_neon: 26999.5 ( 6.34x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-07 15:51:20 +02:00
Krzysztof Pyrkosz	71a91485fa	avcodec/aarch64/vvc: Optimize NEON version of vvc_dmvr This patch replaces blocks of instructions performing rounding and widening shifts with one-liners achieving the same result. Before and after on A78 dmvr_8_12x20_neon: 86.2 ( 6.90x) dmvr_8_20x12_neon: 94.8 ( 5.93x) dmvr_8_20x20_neon: 141.5 ( 6.50x) dmvr_12_12x20_neon: 158.0 ( 3.76x) dmvr_12_20x12_neon: 151.2 ( 3.73x) dmvr_12_20x20_neon: 247.2 ( 3.71x) dmvr_hv_8_12x20_neon: 423.2 ( 3.75x) dmvr_hv_8_20x12_neon: 434.0 ( 3.69x) dmvr_hv_8_20x20_neon: 706.0 ( 3.69x) dmvr_8_12x20_neon: 77.2 ( 7.70x) dmvr_8_20x12_neon: 66.5 ( 8.49x) dmvr_8_20x20_neon: 92.2 ( 9.90x) dmvr_12_12x20_neon: 80.2 ( 7.38x) dmvr_12_20x12_neon: 58.2 ( 9.59x) dmvr_12_20x20_neon: 90.0 (10.15x) dmvr_hv_8_12x20_neon: 369.0 ( 4.34x) dmvr_hv_8_20x12_neon: 355.8 ( 4.49x) dmvr_hv_8_20x20_neon: 574.2 ( 4.51x) Signed-off-by: Martin Storsjö <martin@martin.st>	2025-03-04 10:35:31 +02:00
Zhao Zhili	952508ae05	aarch64/vvc: Add apply_bdof Test on rpi 5 with gcc 12: apply_bdof_8_8x16_c: 7315.2 ( 1.00x) apply_bdof_8_8x16_neon: 1876.8 ( 3.90x) apply_bdof_8_16x8_c: 7170.5 ( 1.00x) apply_bdof_8_16x8_neon: 1752.8 ( 4.09x) apply_bdof_8_16x16_c: 14695.2 ( 1.00x) apply_bdof_8_16x16_neon: 3490.5 ( 4.21x) apply_bdof_10_8x16_c: 7371.5 ( 1.00x) apply_bdof_10_8x16_neon: 1863.8 ( 3.96x) apply_bdof_10_16x8_c: 7172.0 ( 1.00x) apply_bdof_10_16x8_neon: 1766.0 ( 4.06x) apply_bdof_10_16x16_c: 14551.5 ( 1.00x) apply_bdof_10_16x16_neon: 3576.0 ( 4.07x) apply_bdof_12_8x16_c: 7236.5 ( 1.00x) apply_bdof_12_8x16_neon: 1863.8 ( 3.88x) apply_bdof_12_16x8_c: 7316.5 ( 1.00x) apply_bdof_12_16x8_neon: 1758.8 ( 4.16x) apply_bdof_12_16x16_c: 14691.2 ( 1.00x) apply_bdof_12_16x16_neon: 3480.5 ( 4.22x)	2024-12-21 11:54:44 +08:00
Martin Storsjö	2bb00ef59c	aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions Explicitly use ldur for unaligned offsets; newer versions of armasm64 implicitly convert ldr to ldur as necessary, but older versions require it explicitly written out. This fixes these build errors: ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) : error A2518: operand 2: Memory offset must be aligned ldr s5, [x1, #1] ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) : error A2518: operand 2: Memory offset must be aligned ldr d7, [x1, #2] Signed-off-by: Martin Storsjö <martin@martin.st>	2024-12-18 13:45:09 +02:00
Zhao Zhili	40feba5f77	aarch64/vvc: Fix clip in alf Fix test failure: ./tests/checkasm/checkasm --test=vvc_alf 3607569773	2024-12-10 21:00:47 +08:00
Zhao Zhili	91436638de	aarch64/vvc: Use faster clip operation Replace sqxtn+smin+smax by sqxtun+umin.	2024-12-10 21:00:47 +08:00
Zhao Zhili	bfed5f6b7d	aarch64/vvc: Reuse ff_vvc_put_pel_pixels for chroma	2024-12-10 21:00:47 +08:00
Zhao Zhili	5988a2729b	aarch64/vvc: Add dmvr dmvr_8_12x20_c: 1.5 ( 1.00x) dmvr_8_12x20_neon: 0.2 ( 6.56x) dmvr_8_20x12_c: 1.0 ( 1.00x) dmvr_8_20x12_neon: 0.2 ( 4.33x) dmvr_8_20x20_c: 1.7 ( 1.00x) dmvr_8_20x20_neon: 0.5 ( 3.63x) dmvr_12_12x20_c: 2.2 ( 1.00x) dmvr_12_12x20_neon: 0.5 ( 4.68x) dmvr_12_20x12_c: 2.0 ( 1.00x) dmvr_12_20x12_neon: 0.5 ( 4.16x) dmvr_12_20x20_c: 3.7 ( 1.00x) dmvr_12_20x20_neon: 0.7 ( 5.14x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-10-01 10:28:54 +08:00
Zhao Zhili	bcd65ebd8f	aarch64/vvc: Add dmvr_hv dmvr_hv_8_12x20_c: 8.0 ( 1.00x) dmvr_hv_8_12x20_neon: 1.2 ( 6.62x) dmvr_hv_8_20x12_c: 8.0 ( 1.00x) dmvr_hv_8_20x12_neon: 0.9 ( 8.37x) dmvr_hv_8_20x20_c: 12.9 ( 1.00x) dmvr_hv_8_20x20_neon: 1.7 ( 7.62x) dmvr_hv_10_12x20_c: 7.0 ( 1.00x) dmvr_hv_10_12x20_neon: 1.7 ( 4.09x) dmvr_hv_10_20x12_c: 7.0 ( 1.00x) dmvr_hv_10_20x12_neon: 1.7 ( 4.09x) dmvr_hv_10_20x20_c: 11.2 ( 1.00x) dmvr_hv_10_20x20_neon: 2.7 ( 4.15x) dmvr_hv_12_12x20_c: 6.5 ( 1.00x) dmvr_hv_12_12x20_neon: 1.7 ( 3.79x) dmvr_hv_12_20x12_c: 6.5 ( 1.00x) dmvr_hv_12_20x12_neon: 1.7 ( 3.79x) dmvr_hv_12_20x20_c: 10.2 ( 1.00x) dmvr_hv_12_20x20_neon: 2.2 ( 4.64x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-10-01 10:28:54 +08:00
Zhao Zhili	0ba9e8d0d4	aarch64/vvc: Add w_avg w_avg_8_2x2_c: 0.0 ( 0.00x) w_avg_8_2x2_neon: 0.0 ( 0.00x) w_avg_8_4x4_c: 0.2 ( 1.00x) w_avg_8_4x4_neon: 0.0 ( 0.00x) w_avg_8_8x8_c: 1.2 ( 1.00x) w_avg_8_8x8_neon: 0.2 ( 5.00x) w_avg_8_16x16_c: 4.2 ( 1.00x) w_avg_8_16x16_neon: 0.8 ( 5.67x) w_avg_8_32x32_c: 16.2 ( 1.00x) w_avg_8_32x32_neon: 2.5 ( 6.50x) w_avg_8_64x64_c: 64.5 ( 1.00x) w_avg_8_64x64_neon: 9.0 ( 7.17x) w_avg_8_128x128_c: 269.5 ( 1.00x) w_avg_8_128x128_neon: 35.5 ( 7.59x) w_avg_10_2x2_c: 0.2 ( 1.00x) w_avg_10_2x2_neon: 0.2 ( 1.00x) w_avg_10_4x4_c: 0.2 ( 1.00x) w_avg_10_4x4_neon: 0.2 ( 1.00x) w_avg_10_8x8_c: 1.0 ( 1.00x) w_avg_10_8x8_neon: 0.2 ( 4.00x) w_avg_10_16x16_c: 4.2 ( 1.00x) w_avg_10_16x16_neon: 0.8 ( 5.67x) w_avg_10_32x32_c: 16.2 ( 1.00x) w_avg_10_32x32_neon: 2.5 ( 6.50x) w_avg_10_64x64_c: 66.2 ( 1.00x) w_avg_10_64x64_neon: 10.0 ( 6.62x) w_avg_10_128x128_c: 277.8 ( 1.00x) w_avg_10_128x128_neon: 39.8 ( 6.99x) w_avg_12_2x2_c: 0.0 ( 0.00x) w_avg_12_2x2_neon: 0.2 ( 0.00x) w_avg_12_4x4_c: 0.2 ( 1.00x) w_avg_12_4x4_neon: 0.0 ( 0.00x) w_avg_12_8x8_c: 1.2 ( 1.00x) w_avg_12_8x8_neon: 0.5 ( 2.50x) w_avg_12_16x16_c: 4.8 ( 1.00x) w_avg_12_16x16_neon: 0.8 ( 6.33x) w_avg_12_32x32_c: 17.0 ( 1.00x) w_avg_12_32x32_neon: 2.8 ( 6.18x) w_avg_12_64x64_c: 64.0 ( 1.00x) w_avg_12_64x64_neon: 10.0 ( 6.40x) w_avg_12_128x128_c: 269.2 ( 1.00x) w_avg_12_128x128_neon: 42.0 ( 6.41x) Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-10-01 10:28:54 +08:00
Zhao Zhili	3f84d1d1fb	aarch64/vvc: Add avg avg_8_2x2_c: 0.2 ( 1.00x) avg_8_2x2_neon: 0.2 ( 1.00x) avg_8_4x4_c: 0.2 ( 1.00x) avg_8_4x4_neon: 0.2 ( 1.00x) avg_8_8x8_c: 0.9 ( 1.00x) avg_8_8x8_neon: 0.2 ( 5.29x) avg_8_16x16_c: 3.7 ( 1.00x) avg_8_16x16_neon: 0.7 ( 5.44x) avg_8_32x32_c: 14.9 ( 1.00x) avg_8_32x32_neon: 1.7 ( 8.91x) avg_8_64x64_c: 59.7 ( 1.00x) avg_8_64x64_neon: 6.9 ( 8.62x) avg_8_128x128_c: 254.7 ( 1.00x) avg_8_128x128_neon: 26.9 ( 9.46x) avg_10_2x2_c: 0.2 ( 1.00x) avg_10_2x2_neon: 0.2 ( 1.00x) avg_10_4x4_c: 0.2 ( 1.00x) avg_10_4x4_neon: 0.2 ( 1.00x) avg_10_8x8_c: 0.9 ( 1.00x) avg_10_8x8_neon: 0.2 ( 5.29x) avg_10_16x16_c: 3.4 ( 1.00x) avg_10_16x16_neon: 0.4 ( 8.06x) avg_10_32x32_c: 13.9 ( 1.00x) avg_10_32x32_neon: 1.9 ( 7.23x) avg_10_64x64_c: 54.2 ( 1.00x) avg_10_64x64_neon: 8.4 ( 6.43x) avg_10_128x128_c: 232.4 ( 1.00x) avg_10_128x128_neon: 30.9 ( 7.52x) avg_12_2x2_c: 0.0 ( 0.00x) avg_12_2x2_neon: 0.2 ( 0.00x) avg_12_4x4_c: 0.4 ( 1.00x) avg_12_4x4_neon: 0.2 ( 2.43x) avg_12_8x8_c: 0.7 ( 1.00x) avg_12_8x8_neon: 0.2 ( 3.86x) avg_12_16x16_c: 3.7 ( 1.00x) avg_12_16x16_neon: 0.4 ( 8.65x) avg_12_32x32_c: 13.7 ( 1.00x) avg_12_32x32_neon: 2.2 ( 6.29x) avg_12_64x64_c: 53.9 ( 1.00x) avg_12_64x64_neon: 7.7 ( 7.03x) avg_12_128x128_c: 270.9 ( 1.00x) avg_12_128x128_neon: 30.4 ( 8.90x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	1be5a2374f	aarch64/vvc: Add put_epel_hv On Apple M1: put_chroma_hv_8_4x4_c: 1.7 ( 1.00x) put_chroma_hv_8_4x4_neon: 0.2 ( 7.67x) put_chroma_hv_8_8x8_c: 5.5 ( 1.00x) put_chroma_hv_8_8x8_neon: 0.5 (11.53x) put_chroma_hv_8_16x16_c: 18.5 ( 1.00x) put_chroma_hv_8_16x16_neon: 1.5 (12.53x) put_chroma_hv_8_32x32_c: 72.5 ( 1.00x) put_chroma_hv_8_32x32_neon: 4.7 (15.34x) put_chroma_hv_8_64x64_c: 274.0 ( 1.00x) put_chroma_hv_8_64x64_neon: 18.5 (14.83x) put_chroma_hv_8_128x128_c: 1058.7 ( 1.00x) put_chroma_hv_8_128x128_neon: 75.2 (14.07x) On Android Pixel 8 Pro: put_chroma_hv_8_4x4_c: 1.2 ( 1.00x) put_chroma_hv_8_4x4_neon: 0.0 ( 0.00x) put_chroma_hv_8_4x4_i8mm: 0.2 ( 5.00x) put_chroma_hv_8_8x8_c: 4.0 ( 1.00x) put_chroma_hv_8_8x8_neon: 0.5 ( 8.00x) put_chroma_hv_8_8x8_i8mm: 0.5 ( 8.00x) put_chroma_hv_8_16x16_c: 15.2 ( 1.00x) put_chroma_hv_8_16x16_neon: 2.5 ( 6.10x) put_chroma_hv_8_16x16_i8mm: 2.2 ( 6.78x) put_chroma_hv_8_32x32_c: 61.0 ( 1.00x) put_chroma_hv_8_32x32_neon: 9.8 ( 6.26x) put_chroma_hv_8_32x32_i8mm: 8.5 ( 7.18x) put_chroma_hv_8_64x64_c: 229.5 ( 1.00x) put_chroma_hv_8_64x64_neon: 38.5 ( 5.96x) put_chroma_hv_8_64x64_i8mm: 34.0 ( 6.75x) put_chroma_hv_8_128x128_c: 919.8 ( 1.00x) put_chroma_hv_8_128x128_neon: 154.5 ( 5.95x) put_chroma_hv_8_128x128_i8mm: 140.0 ( 6.57x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	0dcf204e5d	aarch64/vvc: Add put_epel_h i8mm put_chroma_h_8_4x4_c: 0.4 ( 1.00x) put_chroma_h_8_4x4_neon: 0.0 ( 0.00x) put_chroma_h_8_4x4_i8mm: 0.1 ( 2.67x) put_chroma_h_8_8x8_c: 1.6 ( 1.00x) put_chroma_h_8_8x8_neon: 0.1 (11.00x) put_chroma_h_8_8x8_i8mm: 0.1 (11.00x) put_chroma_h_8_16x16_c: 6.9 ( 1.00x) put_chroma_h_8_16x16_neon: 1.1 ( 6.00x) put_chroma_h_8_16x16_i8mm: 0.7 (10.62x) put_chroma_h_8_32x32_c: 27.6 ( 1.00x) put_chroma_h_8_32x32_neon: 4.7 ( 5.95x) put_chroma_h_8_32x32_i8mm: 4.4 ( 6.28x) put_chroma_h_8_64x64_c: 116.2 ( 1.00x) put_chroma_h_8_64x64_neon: 19.1 ( 6.07x) put_chroma_h_8_64x64_i8mm: 17.1 ( 6.77x) put_chroma_h_8_128x128_c: 466.6 ( 1.00x) put_chroma_h_8_128x128_neon: 81.4 ( 5.73x) put_chroma_h_8_128x128_i8mm: 71.7 ( 6.51x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	41a1885f7a	aarch64/vvc: Add put_epel_h put_chroma_h_8_4x4_c: 0.2 ( 1.00x) put_chroma_h_8_4x4_neon: 0.2 ( 1.00x) put_chroma_h_8_8x8_c: 0.8 ( 1.00x) put_chroma_h_8_8x8_neon: 0.2 ( 3.00x) put_chroma_h_8_16x16_c: 3.8 ( 1.00x) put_chroma_h_8_16x16_neon: 0.8 ( 5.00x) put_chroma_h_8_32x32_c: 12.5 ( 1.00x) put_chroma_h_8_32x32_neon: 2.2 ( 5.56x) put_chroma_h_8_64x64_c: 47.0 ( 1.00x) put_chroma_h_8_64x64_neon: 8.8 ( 5.37x) put_chroma_h_8_128x128_c: 200.2 ( 1.00x) put_chroma_h_8_128x128_neon: 31.8 ( 6.31x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	260e1b4b62	aarch64/vvc: Add sad sad_8x16_c: 0.8 ( 1.00x) sad_8x16_neon: 0.2 ( 3.00x) sad_16x8_c: 0.5 ( 1.00x) sad_16x8_neon: 0.2 ( 2.00x) sad_16x16_c: 1.5 ( 1.00x) sad_16x16_neon: 0.2 ( 6.00x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	5ac6925803	aarch64/vvc: Add put_qpel_hv With Apple M1 (no i8mm): put_luma_hv_8_4x4_c: 2.2 ( 1.00x) put_luma_hv_8_4x4_neon: 0.8 ( 3.00x) put_luma_hv_8_8x8_c: 7.0 ( 1.00x) put_luma_hv_8_8x8_neon: 0.8 ( 9.33x) put_luma_hv_8_16x16_c: 22.8 ( 1.00x) put_luma_hv_8_16x16_neon: 2.5 ( 9.10x) put_luma_hv_8_32x32_c: 84.8 ( 1.00x) put_luma_hv_8_32x32_neon: 9.5 ( 8.92x) put_luma_hv_8_64x64_c: 333.0 ( 1.00x) put_luma_hv_8_64x64_neon: 35.5 ( 9.38x) put_luma_hv_8_128x128_c: 1294.5 ( 1.00x) put_luma_hv_8_128x128_neon: 137.8 ( 9.40x) With Pixel 8 Pro: put_luma_hv_8_4x4_c: 5.0 ( 1.00x) put_luma_hv_8_4x4_neon: 0.8 ( 6.67x) put_luma_hv_8_4x4_i8mm: 0.2 (20.00x) put_luma_hv_8_8x8_c: 13.2 ( 1.00x) put_luma_hv_8_8x8_neon: 1.2 (10.60x) put_luma_hv_8_8x8_i8mm: 1.2 (10.60x) put_luma_hv_8_16x16_c: 44.2 ( 1.00x) put_luma_hv_8_16x16_neon: 4.5 ( 9.83x) put_luma_hv_8_16x16_i8mm: 4.2 (10.41x) put_luma_hv_8_32x32_c: 160.8 ( 1.00x) put_luma_hv_8_32x32_neon: 17.5 ( 9.19x) put_luma_hv_8_32x32_i8mm: 16.0 (10.05x) put_luma_hv_8_64x64_c: 611.2 ( 1.00x) put_luma_hv_8_64x64_neon: 68.0 ( 8.99x) put_luma_hv_8_64x64_i8mm: 62.2 ( 9.82x) put_luma_hv_8_128x128_c: 2384.8 ( 1.00x) put_luma_hv_8_128x128_neon: 268.8 ( 8.87x) put_luma_hv_8_128x128_i8mm: 245.8 ( 9.70x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	a0b52afd32	aarch64/vvc: Add put_qpel_vx put_luma_v_8_4x4_c: 1.0 ( 1.00x) put_luma_v_8_4x4_neon: 0.0 ( 0.00x) put_luma_v_8_8x8_c: 3.5 ( 1.00x) put_luma_v_8_8x8_neon: 0.5 ( 7.00x) put_luma_v_8_16x16_c: 13.8 ( 1.00x) put_luma_v_8_16x16_neon: 1.2 (11.00x) put_luma_v_8_32x32_c: 54.2 ( 1.00x) put_luma_v_8_32x32_neon: 5.0 (10.85x) put_luma_v_8_64x64_c: 217.5 ( 1.00x) put_luma_v_8_64x64_neon: 18.8 (11.60x) put_luma_v_8_128x128_c: 886.2 ( 1.00x) put_luma_v_8_128x128_neon: 74.0 (11.98x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	9f6c8eb412	aarch64/vvc: Add put_qpel_hx i8mm Benchmark on Android pixel 8 with -fno-vectorize put_luma_h_8_4x4_c: 0.2 ( 1.00x) put_luma_h_8_4x4_neon: 0.2 ( 1.00x) put_luma_h_8_4x4_i8mm: 0.0 ( 0.00x) put_luma_h_8_8x8_c: 1.5 ( 1.00x) put_luma_h_8_8x8_neon: 0.5 ( 3.00x) put_luma_h_8_8x8_i8mm: 0.5 ( 3.00x) put_luma_h_8_16x16_c: 6.2 ( 1.00x) put_luma_h_8_16x16_neon: 2.0 ( 3.12x) put_luma_h_8_16x16_i8mm: 1.5 ( 4.17x) put_luma_h_8_32x32_c: 25.5 ( 1.00x) put_luma_h_8_32x32_neon: 9.0 ( 2.83x) put_luma_h_8_32x32_i8mm: 6.8 ( 3.78x) put_luma_h_8_64x64_c: 99.8 ( 1.00x) put_luma_h_8_64x64_neon: 35.2 ( 2.83x) put_luma_h_8_64x64_i8mm: 27.2 ( 3.66x) put_luma_h_8_128x128_c: 422.0 ( 1.00x) put_luma_h_8_128x128_neon: 138.5 ( 3.05x) put_luma_h_8_128x128_i8mm: 109.2 ( 3.86x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	25448d1716	aarch64/vvc: Add put_pel/put_pel_uni/put_pel_uni_w put_luma_pixels_8_4x4_c: 0.2 ( 1.00x) put_luma_pixels_8_4x4_neon: 0.2 ( 1.00x) put_luma_pixels_8_8x8_c: 0.7 ( 1.00x) put_luma_pixels_8_8x8_neon: 0.2 ( 3.22x) put_luma_pixels_8_16x16_c: 2.2 ( 1.00x) put_luma_pixels_8_16x16_neon: 0.2 ( 9.89x) put_luma_pixels_8_32x32_c: 8.2 ( 1.00x) put_luma_pixels_8_32x32_neon: 1.2 ( 6.71x) put_luma_pixels_8_64x64_c: 33.7 ( 1.00x) put_luma_pixels_8_64x64_neon: 2.5 (13.63x) put_luma_pixels_8_128x128_c: 145.5 ( 1.00x) put_luma_pixels_8_128x128_neon: 10.2 (14.23x) put_uni_pixels_luma_8_4x4_c: 0.5 ( 1.00x) put_uni_pixels_luma_8_4x4_neon: 0.0 ( 0.00x) put_uni_pixels_luma_8_8x8_c: 0.5 ( 1.00x) put_uni_pixels_luma_8_8x8_neon: 0.2 ( 2.11x) put_uni_pixels_luma_8_16x16_c: 1.2 ( 1.00x) put_uni_pixels_luma_8_16x16_neon: 0.2 ( 5.44x) put_uni_pixels_luma_8_32x32_c: 3.0 ( 1.00x) put_uni_pixels_luma_8_32x32_neon: 0.5 ( 6.26x) put_uni_pixels_luma_8_64x64_c: 3.0 ( 1.00x) put_uni_pixels_luma_8_64x64_neon: 1.7 ( 1.72x) put_uni_pixels_luma_8_128x128_c: 6.5 ( 1.00x) put_uni_pixels_luma_8_128x128_neon: 6.5 ( 1.00x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	20f2bf5530	aarch64/vvc: Add put_qpel_h_* and put_qpel_uni_h_* Just share hevc implementation. checkasm --test=vvc_mc --benchmark: put_luma_h_8_4x4_c: 0.2 ( 1.00x) put_luma_h_8_4x4_neon: 0.2 ( 1.00x) put_luma_h_8_8x8_c: 1.0 ( 1.00x) put_luma_h_8_8x8_neon: 0.2 ( 4.33x) put_luma_h_8_16x16_c: 3.2 ( 1.00x) put_luma_h_8_16x16_neon: 1.2 ( 2.63x) put_luma_h_8_32x32_c: 13.7 ( 1.00x) put_luma_h_8_32x32_neon: 4.0 ( 3.45x) put_luma_h_8_64x64_c: 48.2 ( 1.00x) put_luma_h_8_64x64_neon: 15.7 ( 3.07x) put_luma_h_8_128x128_c: 203.5 ( 1.00x) put_luma_h_8_128x128_neon: 62.0 ( 3.28x) put_uni_h_luma_8_4x4_c: 0.2 ( 1.00x) put_uni_h_luma_8_4x4_neon: 0.2 ( 1.00x) put_uni_h_luma_8_8x8_c: 1.5 ( 1.00x) put_uni_h_luma_8_8x8_neon: 0.2 ( 6.56x) put_uni_h_luma_8_16x16_c: 5.7 ( 1.00x) put_uni_h_luma_8_16x16_neon: 1.2 ( 4.67x) put_uni_h_luma_8_32x32_c: 24.0 ( 1.00x) put_uni_h_luma_8_32x32_neon: 4.7 ( 5.07x) put_uni_h_luma_8_64x64_c: 90.0 ( 1.00x) put_uni_h_luma_8_64x64_neon: 17.0 ( 5.30x) put_uni_h_luma_8_128x128_c: 357.7 ( 1.00x) put_uni_h_luma_8_128x128_neon: 67.5 ( 5.30x)	2024-09-14 16:36:34 +08:00
Zhao Zhili	4c0372281b	aarch64/vvc: Bind h26x/sao filter implementation to vvc Reviewed-by: Martin Storsjö <martin@martin.st>	2024-08-31 16:07:50 +08:00
Martin Storsjö	4acb9b7d10	aarch64: vvc: Fix unnecessary extra spaces Signed-off-by: Martin Storsjö <martin@martin.st>	2024-07-23 16:04:28 +03:00
Martin Storsjö	99598629e8	aarch64: vvc: Consistently use # for immediate constants Signed-off-by: Martin Storsjö <martin@martin.st>	2024-07-23 15:24:37 +03:00
Martin Storsjö	400843151d	aarch64: vvc: Fix compilation of alf.S with MSVC 2022 17.7 and older Use the "ldur" instruction explicitly, instead of having the assembler implicitly convert "ldr" instructions to "ldur". This fixes build errors like these: libavcodec\aarch64\vvc\alf.o.asm(1023) : error A2518: operand 2: Memory offset must be aligned ldr q22, [x3, #24] libavcodec\aarch64\vvc\alf.o.asm(1024) : error A2518: operand 2: Memory offset must be aligned ldr q24, [x2, #24] libavcodec\aarch64\vvc\alf.o.asm(1393) : error A2518: operand 2: Memory offset must be aligned ldr q22, [x3, #24] libavcodec\aarch64\vvc\alf.o.asm(1394) : error A2518: operand 2: Memory offset must be aligned ldr q24, [x2, #24] Signed-off-by: Martin Storsjö <martin@martin.st>	2024-07-23 15:24:33 +03:00
Zhao Zhili	2d4ef304c9	avcodec/vvc: Add aarch64 neon optimization for ALF vvc_alf_filter_chroma_4x4_8_c: 3.0 vvc_alf_filter_chroma_4x4_8_neon: 1.0 vvc_alf_filter_chroma_4x4_10_c: 2.7 vvc_alf_filter_chroma_4x4_10_neon: 1.0 vvc_alf_filter_chroma_4x4_12_c: 2.7 vvc_alf_filter_chroma_4x4_12_neon: 1.0 vvc_alf_filter_chroma_8x8_8_c: 10.2 vvc_alf_filter_chroma_8x8_8_neon: 3.0 vvc_alf_filter_chroma_8x8_10_c: 10.0 vvc_alf_filter_chroma_8x8_10_neon: 2.5 vvc_alf_filter_chroma_8x8_12_c: 10.0 vvc_alf_filter_chroma_8x8_12_neon: 2.5 vvc_alf_filter_chroma_16x16_8_c: 41.7 vvc_alf_filter_chroma_16x16_8_neon: 11.2 vvc_alf_filter_chroma_16x16_10_c: 39.0 vvc_alf_filter_chroma_16x16_10_neon: 10.0 vvc_alf_filter_chroma_16x16_12_c: 40.2 vvc_alf_filter_chroma_16x16_12_neon: 10.2 vvc_alf_filter_chroma_32x32_8_c: 162.0 vvc_alf_filter_chroma_32x32_8_neon: 45.0 vvc_alf_filter_chroma_32x32_10_c: 155.5 vvc_alf_filter_chroma_32x32_10_neon: 39.5 vvc_alf_filter_chroma_32x32_12_c: 155.5 vvc_alf_filter_chroma_32x32_12_neon: 40.0 vvc_alf_filter_chroma_64x64_8_c: 646.0 vvc_alf_filter_chroma_64x64_8_neon: 175.5 vvc_alf_filter_chroma_64x64_10_c: 708.2 vvc_alf_filter_chroma_64x64_10_neon: 166.7 vvc_alf_filter_chroma_64x64_12_c: 619.2 vvc_alf_filter_chroma_64x64_12_neon: 157.2 vvc_alf_filter_chroma_128x128_8_c: 2611.5 vvc_alf_filter_chroma_128x128_8_neon: 698.2 vvc_alf_filter_chroma_128x128_10_c: 2470.0 vvc_alf_filter_chroma_128x128_10_neon: 616.0 vvc_alf_filter_chroma_128x128_12_c: 2531.5 vvc_alf_filter_chroma_128x128_12_neon: 620.2 vvc_alf_filter_luma_8x8_8_c: 25.2 vvc_alf_filter_luma_8x8_8_neon: 4.2 vvc_alf_filter_luma_8x8_10_c: 18.5 vvc_alf_filter_luma_8x8_10_neon: 4.0 vvc_alf_filter_luma_8x8_12_c: 19.0 vvc_alf_filter_luma_8x8_12_neon: 4.0 vvc_alf_filter_luma_16x16_8_c: 106.5 vvc_alf_filter_luma_16x16_8_neon: 16.2 vvc_alf_filter_luma_16x16_10_c: 75.2 vvc_alf_filter_luma_16x16_10_neon: 14.7 vvc_alf_filter_luma_16x16_12_c: 79.7 vvc_alf_filter_luma_16x16_12_neon: 14.7 vvc_alf_filter_luma_32x32_8_c: 400.5 vvc_alf_filter_luma_32x32_8_neon: 63.2 vvc_alf_filter_luma_32x32_10_c: 299.2 vvc_alf_filter_luma_32x32_10_neon: 57.7 vvc_alf_filter_luma_32x32_12_c: 299.2 vvc_alf_filter_luma_32x32_12_neon: 57.7 vvc_alf_filter_luma_64x64_8_c: 1602.5 vvc_alf_filter_luma_64x64_8_neon: 251.7 vvc_alf_filter_luma_64x64_10_c: 1197.0 vvc_alf_filter_luma_64x64_10_neon: 235.5 vvc_alf_filter_luma_64x64_12_c: 1220.2 vvc_alf_filter_luma_64x64_12_neon: 235.7 vvc_alf_filter_luma_128x128_8_c: 6570.2 vvc_alf_filter_luma_128x128_8_neon: 1007.7 vvc_alf_filter_luma_128x128_10_c: 4822.7 vvc_alf_filter_luma_128x128_10_neon: 936.2 vvc_alf_filter_luma_128x128_12_c: 4791.2 vvc_alf_filter_luma_128x128_12_neon: 938.5 Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>	2024-07-22 21:09:56 +08:00

42 Commits