Martin Storsjö
74cfcd1c69
aarch64/vvc: Fix DCE undefined references with MSVC
...
This fixes compiling with MSVC for aarch64 after
510999f6b0 .
While MSVC does do dead code elimintation for function references
within e.g. "if (0)", it doesn't do that for functions referenced
within a static function, even if that static function itself ends
up not used.
A reproduction example:
void missing(void);
void (*func_ptr)(void);
static void wrapper(void) {
missing();
}
void init(int cpu_flags) {
if (0) {
func_ptr = wrapper;
}
}
If "wrapper" is entirely unreferenced, then MSVC doesn't produce
any reference to the symbol "missing". Also, if we do
"func_ptr = missing;" then the reference to missing also is
eliminated. But for the case of referencing the function in a
static function, even if the reference to the static function can
be eliminated, then MSVC does keep the reference to the symbol.
2026-03-05 11:57:40 +02:00
Georgii Zagoruiko
510999f6b0
aarch64/vvc: sme2 optimisation of alf_filter_luma() 8/10/12 bit
...
Apple M4:
vvc_alf_filter_luma_8x8_8_c: 347.3 ( 1.00x)
vvc_alf_filter_luma_8x8_8_neon: 138.7 ( 2.50x)
vvc_alf_filter_luma_8x8_8_sme2: 134.5 ( 2.58x)
vvc_alf_filter_luma_8x8_10_c: 299.8 ( 1.00x)
vvc_alf_filter_luma_8x8_10_neon: 129.8 ( 2.31x)
vvc_alf_filter_luma_8x8_10_sme2: 128.6 ( 2.33x)
vvc_alf_filter_luma_8x8_12_c: 293.0 ( 1.00x)
vvc_alf_filter_luma_8x8_12_neon: 126.8 ( 2.31x)
vvc_alf_filter_luma_8x8_12_sme2: 126.3 ( 2.32x)
vvc_alf_filter_luma_16x16_8_c: 1386.1 ( 1.00x)
vvc_alf_filter_luma_16x16_8_neon: 560.3 ( 2.47x)
vvc_alf_filter_luma_16x16_8_sme2: 540.1 ( 2.57x)
vvc_alf_filter_luma_16x16_10_c: 1200.3 ( 1.00x)
vvc_alf_filter_luma_16x16_10_neon: 515.6 ( 2.33x)
vvc_alf_filter_luma_16x16_10_sme2: 531.3 ( 2.26x)
vvc_alf_filter_luma_16x16_12_c: 1223.8 ( 1.00x)
vvc_alf_filter_luma_16x16_12_neon: 510.7 ( 2.40x)
vvc_alf_filter_luma_16x16_12_sme2: 524.9 ( 2.33x)
vvc_alf_filter_luma_32x32_8_c: 5488.8 ( 1.00x)
vvc_alf_filter_luma_32x32_8_neon: 2233.4 ( 2.46x)
vvc_alf_filter_luma_32x32_8_sme2: 1093.6 ( 5.02x)
vvc_alf_filter_luma_32x32_10_c: 4738.0 ( 1.00x)
vvc_alf_filter_luma_32x32_10_neon: 2057.5 ( 2.30x)
vvc_alf_filter_luma_32x32_10_sme2: 1053.6 ( 4.50x)
vvc_alf_filter_luma_32x32_12_c: 4808.3 ( 1.00x)
vvc_alf_filter_luma_32x32_12_neon: 1981.2 ( 2.43x)
vvc_alf_filter_luma_32x32_12_sme2: 1047.7 ( 4.59x)
vvc_alf_filter_luma_64x64_8_c: 22116.8 ( 1.00x)
vvc_alf_filter_luma_64x64_8_neon: 8951.0 ( 2.47x)
vvc_alf_filter_luma_64x64_8_sme2: 4225.2 ( 5.23x)
vvc_alf_filter_luma_64x64_10_c: 19072.8 ( 1.00x)
vvc_alf_filter_luma_64x64_10_neon: 8448.1 ( 2.26x)
vvc_alf_filter_luma_64x64_10_sme2: 4225.8 ( 4.51x)
vvc_alf_filter_luma_64x64_12_c: 19312.6 ( 1.00x)
vvc_alf_filter_luma_64x64_12_neon: 8270.9 ( 2.34x)
vvc_alf_filter_luma_64x64_12_sme2: 4245.4 ( 4.55x)
vvc_alf_filter_luma_128x128_8_c: 88530.5 ( 1.00x)
vvc_alf_filter_luma_128x128_8_neon: 35686.3 ( 2.48x)
vvc_alf_filter_luma_128x128_8_sme2: 16961.2 ( 5.22x)
vvc_alf_filter_luma_128x128_10_c: 76904.9 ( 1.00x)
vvc_alf_filter_luma_128x128_10_neon: 32439.5 ( 2.37x)
vvc_alf_filter_luma_128x128_10_sme2: 16845.6 ( 4.57x)
vvc_alf_filter_luma_128x128_12_c: 77363.3 ( 1.00x)
vvc_alf_filter_luma_128x128_12_neon: 32907.5 ( 2.35x)
vvc_alf_filter_luma_128x128_12_sme2: 17018.1 ( 4.55x)
2026-03-04 23:52:58 +02:00
Georgii Zagoruiko
90431417cb
aarch64/vvc: Optimisations of put_luma_hv() functions for 10/12-bit
...
Apple M2:
put_luma_hv_10_4x4_c: 36.3 ( 1.00x)
put_luma_hv_10_8x8_c: 82.9 ( 1.00x)
put_luma_hv_10_8x8_neon: 34.9 ( 2.37x)
put_luma_hv_10_16x16_c: 239.2 ( 1.00x)
put_luma_hv_10_16x16_neon: 119.0 ( 2.01x)
put_luma_hv_10_32x32_c: 900.3 ( 1.00x)
put_luma_hv_10_32x32_neon: 429.3 ( 2.10x)
put_luma_hv_10_64x64_c: 2984.7 ( 1.00x)
put_luma_hv_10_64x64_neon: 1736.2 ( 1.72x)
put_luma_hv_10_128x128_c: 11194.2 ( 1.00x)
put_luma_hv_10_128x128_neon: 6357.3 ( 1.76x)
put_luma_hv_12_4x4_c: 35.9 ( 1.00x)
put_luma_hv_12_8x8_c: 82.6 ( 1.00x)
put_luma_hv_12_8x8_neon: 34.3 ( 2.41x)
put_luma_hv_12_16x16_c: 240.2 ( 1.00x)
put_luma_hv_12_16x16_neon: 115.3 ( 2.08x)
put_luma_hv_12_32x32_c: 787.7 ( 1.00x)
put_luma_hv_12_32x32_neon: 414.2 ( 1.90x)
put_luma_hv_12_64x64_c: 3058.4 ( 1.00x)
put_luma_hv_12_64x64_neon: 1592.3 ( 1.92x)
put_luma_hv_12_128x128_c: 11350.8 ( 1.00x)
put_luma_hv_12_128x128_neon: 6378.3 ( 1.78x)
RPi4:
put_luma_hv_10_4x4_c: 637.8 ( 1.00x)
put_luma_hv_10_8x8_c: 1044.9 ( 1.00x)
put_luma_hv_10_8x8_neon: 483.7 ( 2.16x)
put_luma_hv_10_16x16_c: 3098.0 ( 1.00x)
put_luma_hv_10_16x16_neon: 1603.1 ( 1.93x)
put_luma_hv_10_32x32_c: 10054.8 ( 1.00x)
put_luma_hv_10_32x32_neon: 5843.6 ( 1.72x)
put_luma_hv_10_64x64_c: 40506.2 ( 1.00x)
put_luma_hv_10_64x64_neon: 24384.0 ( 1.66x)
put_luma_hv_10_128x128_c: 130604.2 ( 1.00x)
put_luma_hv_10_128x128_neon: 99746.6 ( 1.31x)
put_luma_hv_12_4x4_c: 638.2 ( 1.00x)
put_luma_hv_12_8x8_c: 1074.6 ( 1.00x)
put_luma_hv_12_8x8_neon: 482.6 ( 2.23x)
put_luma_hv_12_16x16_c: 3094.0 ( 1.00x)
put_luma_hv_12_16x16_neon: 1602.5 ( 1.93x)
put_luma_hv_12_32x32_c: 10034.4 ( 1.00x)
put_luma_hv_12_32x32_neon: 5843.3 ( 1.72x)
put_luma_hv_12_64x64_c: 40447.5 ( 1.00x)
put_luma_hv_12_64x64_neon: 24377.2 ( 1.66x)
put_luma_hv_12_128x128_c: 130610.4 ( 1.00x)
put_luma_hv_12_128x128_neon: 99765.8 ( 1.31x)
2026-03-04 12:53:16 +00:00
Andreas Rheinhardt
dc65dcec22
avcodec/vvc/inter: Combine offsets early
...
For bi-predicted weighted averages, only the sum
of the two offsets is ever used, so add the two early.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com >
2026-02-25 12:08:33 +01:00
Georgii Zagoruiko
8acdffa22c
aarch64/vvc: Optimisations of put_luma_v() functions for 10/12-bit
...
RPi4 (auto-vectorisation is on)
put_luma_v_10_4x4_c: 303.3 ( 1.00x)
put_luma_v_10_4x4_neon: 55.7 ( 5.45x)
put_luma_v_10_8x8_c: 1106.7 ( 1.00x)
put_luma_v_10_8x8_neon: 163.8 ( 6.76x)
put_luma_v_10_16x16_c: 2242.1 ( 1.00x)
put_luma_v_10_16x16_neon: 672.7 ( 3.33x)
put_luma_v_10_32x32_c: 7057.3 ( 1.00x)
put_luma_v_10_32x32_neon: 2731.3 ( 2.58x)
put_luma_v_10_64x64_c: 25699.8 ( 1.00x)
put_luma_v_10_64x64_neon: 12145.6 ( 2.12x)
put_luma_v_10_128x128_c: 90694.6 ( 1.00x)
put_luma_v_10_128x128_neon: 44862.4 ( 2.02x)
put_luma_v_12_4x4_c: 304.4 ( 1.00x)
put_luma_v_12_4x4_neon: 55.6 ( 5.47x)
put_luma_v_12_8x8_c: 1107.4 ( 1.00x)
put_luma_v_12_8x8_neon: 164.7 ( 6.72x)
put_luma_v_12_16x16_c: 2235.8 ( 1.00x)
put_luma_v_12_16x16_neon: 672.5 ( 3.32x)
put_luma_v_12_32x32_c: 7049.2 ( 1.00x)
put_luma_v_12_32x32_neon: 2731.6 ( 2.58x)
put_luma_v_12_64x64_c: 25706.5 ( 1.00x)
put_luma_v_12_64x64_neon: 12145.0 ( 2.12x)
put_luma_v_12_128x128_c: 90672.5 ( 1.00x)
put_luma_v_12_128x128_neon: 44857.1 ( 2.02x)
Apple M4 (auto-vectorisation is on):
put_luma_v_10_4x4_c: 25.6 ( 1.00x)
put_luma_v_10_4x4_neon: 3.1 ( 8.18x)
put_luma_v_10_8x8_c: 34.7 ( 1.00x)
put_luma_v_10_8x8_neon: 10.5 ( 3.32x)
put_luma_v_10_16x16_c: 103.9 ( 1.00x)
put_luma_v_10_16x16_neon: 42.3 ( 2.45x)
put_luma_v_10_32x32_c: 399.7 ( 1.00x)
put_luma_v_10_32x32_neon: 161.8 ( 2.47x)
put_luma_v_10_64x64_c: 1276.7 ( 1.00x)
put_luma_v_10_64x64_neon: 840.1 ( 1.52x)
put_luma_v_10_128x128_c: 4981.3 ( 1.00x)
put_luma_v_10_128x128_neon: 3008.0 ( 1.66x)
put_luma_v_12_4x4_c: 23.6 ( 1.00x)
put_luma_v_12_4x4_neon: 2.0 (11.84x)
put_luma_v_12_8x8_c: 31.8 ( 1.00x)
put_luma_v_12_8x8_neon: 12.4 ( 2.55x)
put_luma_v_12_16x16_c: 100.8 ( 1.00x)
put_luma_v_12_16x16_neon: 44.9 ( 2.25x)
put_luma_v_12_32x32_c: 331.1 ( 1.00x)
put_luma_v_12_32x32_neon: 175.2 ( 1.89x)
put_luma_v_12_64x64_c: 1227.1 ( 1.00x)
put_luma_v_12_64x64_neon: 712.7 ( 1.72x)
put_luma_v_12_128x128_c: 5149.1 ( 1.00x)
put_luma_v_12_128x128_neon: 2809.3 ( 1.83x)
2026-01-08 17:35:55 +00:00
Georgii Zagoruiko
f790de2a87
aarch64/vvc: Optimisations of put_luma_h() functions for 10/12-bit
...
RPi4 (auto-vectorisation is turned on)
put_luma_h_10_4x4_c: 282.8 ( 1.00x)
put_luma_h_10_8x8_c: 1069.5 ( 1.00x)
put_luma_h_10_8x8_neon: 207.5 ( 5.15x)
put_luma_h_10_16x16_c: 1999.6 ( 1.00x)
put_luma_h_10_16x16_neon: 777.5 ( 2.57x)
put_luma_h_10_32x32_c: 6612.9 ( 1.00x)
put_luma_h_10_32x32_neon: 3201.6 ( 2.07x)
put_luma_h_10_64x64_c: 25059.0 ( 1.00x)
put_luma_h_10_64x64_neon: 13623.5 ( 1.84x)
put_luma_h_10_128x128_c: 91310.1 ( 1.00x)
put_luma_h_10_128x128_neon: 50358.3 ( 1.81x)
put_luma_h_12_4x4_c: 282.1 ( 1.00x)
put_luma_h_12_8x8_c: 1068.4 ( 1.00x)
put_luma_h_12_8x8_neon: 207.7 ( 5.14x)
put_luma_h_12_16x16_c: 1998.0 ( 1.00x)
put_luma_h_12_16x16_neon: 777.5 ( 2.57x)
put_luma_h_12_32x32_c: 6612.0 ( 1.00x)
put_luma_h_12_32x32_neon: 3201.6 ( 2.07x)
put_luma_h_12_64x64_c: 25036.8 ( 1.00x)
put_luma_h_12_64x64_neon: 13595.1 ( 1.84x)
put_luma_h_12_128x128_c: 91305.8 ( 1.00x)
put_luma_h_12_128x128_neon: 50359.7 ( 1.81x)
Apple M2 Air (auto-vectorisation is turned on)
put_luma_h_10_4x4_c: 0.3 ( 1.00x)
put_luma_h_10_8x8_c: 1.0 ( 1.00x)
put_luma_h_10_8x8_neon: 0.4 ( 2.59x)
put_luma_h_10_16x16_c: 2.9 ( 1.00x)
put_luma_h_10_16x16_neon: 1.4 ( 2.01x)
put_luma_h_10_32x32_c: 9.4 ( 1.00x)
put_luma_h_10_32x32_neon: 5.8 ( 1.62x)
put_luma_h_10_64x64_c: 35.6 ( 1.00x)
put_luma_h_10_64x64_neon: 23.6 ( 1.51x)
put_luma_h_10_128x128_c: 131.1 ( 1.00x)
put_luma_h_10_128x128_neon: 92.6 ( 1.42x)
put_luma_h_12_4x4_c: 0.3 ( 1.00x)
put_luma_h_12_8x8_c: 1.0 ( 1.00x)
put_luma_h_12_8x8_neon: 0.4 ( 2.58x)
put_luma_h_12_16x16_c: 2.9 ( 1.00x)
put_luma_h_12_16x16_neon: 1.4 ( 2.00x)
put_luma_h_12_32x32_c: 9.4 ( 1.00x)
put_luma_h_12_32x32_neon: 5.8 ( 1.61x)
put_luma_h_12_64x64_c: 35.3 ( 1.00x)
put_luma_h_12_64x64_neon: 23.3 ( 1.52x)
put_luma_h_12_128x128_c: 131.2 ( 1.00x)
put_luma_h_12_128x128_neon: 92.4 ( 1.42x)
2025-11-24 21:22:55 +00:00
Krzysztof Pyrkosz
03c054d43c
avcodec/aarch64/vvc: Implement dmvr_v_8
...
A72
dmvr_v_8_12x20_neon: 207.0 ( 4.15x)
dmvr_v_8_20x12_neon: 170.4 ( 4.37x)
dmvr_v_8_20x20_neon: 273.4 ( 4.58x)
A53
dmvr_v_8_12x20_neon: 450.6 ( 4.21x)
dmvr_v_8_20x12_neon: 342.8 ( 3.70x)
dmvr_v_8_20x20_neon: 550.9 ( 3.79x)
2025-09-23 11:20:20 +00:00
Krzysztof Pyrkosz
56a638d836
avcodec/aarch64/vvc: Unroll vvc_bdof_grad_filter_8x_neon
...
Before and after:
A53:
apply_bdof_8_16x8_neon: 2733.1 ( 4.88x)
apply_bdof_8_16x16_neon: 5458.6 ( 4.86x)
apply_bdof_10_16x8_neon: 2789.8 ( 4.64x)
apply_bdof_10_16x16_neon: 5523.8 ( 4.68x)
apply_bdof_12_16x8_neon: 2792.8 ( 4.58x)
apply_bdof_12_16x16_neon: 5519.5 ( 4.63x)
apply_bdof_8_16x8_neon: 2571.8 ( 5.12x)
apply_bdof_8_16x16_neon: 5173.3 ( 5.12x)
apply_bdof_10_16x8_neon: 2635.1 ( 4.87x)
apply_bdof_10_16x16_neon: 5243.0 ( 4.89x)
apply_bdof_12_16x8_neon: 2613.0 ( 4.89x)
apply_bdof_12_16x16_neon: 5231.7 ( 4.90x)
A78:
apply_bdof_8_16x8_neon: 565.3 ( 8.43x)
apply_bdof_8_16x16_neon: 1109.5 ( 8.60x)
apply_bdof_10_16x8_neon: 568.2 ( 7.92x)
apply_bdof_10_16x16_neon: 1114.1 ( 8.08x)
apply_bdof_12_16x8_neon: 570.2 ( 7.87x)
apply_bdof_12_16x16_neon: 1116.3 ( 8.03x)
apply_bdof_8_16x8_neon: 541.4 ( 8.81x)
apply_bdof_8_16x16_neon: 1065.9 ( 8.97x)
apply_bdof_10_16x8_neon: 543.2 ( 8.32x)
apply_bdof_10_16x16_neon: 1071.5 ( 8.39x)
apply_bdof_12_16x8_neon: 544.2 ( 8.25x)
apply_bdof_12_16x16_neon: 1074.1 ( 8.37x)
2025-09-23 11:20:11 +00:00
Krzysztof Pyrkosz
f1a155d975
avcodec/aarch64/vvc: Optimize dmvr_hv_10
...
Before and after on A53:
dmvr_hv_10_12x20_neon: 1838.2 ( 3.02x)
dmvr_hv_10_20x12_neon: 1330.2 ( 1.83x)
dmvr_hv_10_20x20_neon: 2148.2 ( 1.85x)
dmvr_hv_12_12x20_neon: 1839.2 ( 3.02x)
dmvr_hv_12_20x12_neon: 1330.6 ( 1.83x)
dmvr_hv_12_20x20_neon: 2147.2 ( 1.85x)
dmvr_hv_10_12x20_neon: 1755.0 ( 3.17x)
dmvr_hv_10_20x12_neon: 1165.8 ( 2.09x)
dmvr_hv_10_20x20_neon: 1876.1 ( 2.12x)
dmvr_hv_12_12x20_neon: 1754.4 ( 3.17x)
dmvr_hv_12_20x12_neon: 1167.8 ( 2.09x)
dmvr_hv_12_20x20_neon: 1878.8 ( 2.12x)
2025-09-21 19:39:27 +00:00
Georgii Zagoruiko
4fbacb3944
avcodec/aarch64/vvc: Optimised version of classify function.
...
Macbook Air (M2):
vvc_alf_classify_8x8_8_c: 2.6 ( 1.00x)
vvc_alf_classify_8x8_8_neon: 1.0 ( 2.47x)
vvc_alf_classify_8x8_10_c: 2.7 ( 1.00x)
vvc_alf_classify_8x8_10_neon: 0.9 ( 2.98x)
vvc_alf_classify_8x8_12_c: 2.7 ( 1.00x)
vvc_alf_classify_8x8_12_neon: 0.9 ( 2.97x)
vvc_alf_classify_16x16_8_c: 7.3 ( 1.00x)
vvc_alf_classify_16x16_8_neon: 3.4 ( 2.12x)
vvc_alf_classify_16x16_10_c: 4.3 ( 1.00x)
vvc_alf_classify_16x16_10_neon: 2.9 ( 1.47x)
vvc_alf_classify_16x16_12_c: 4.3 ( 1.00x)
vvc_alf_classify_16x16_12_neon: 3.0 ( 1.44x)
vvc_alf_classify_32x32_8_c: 13.7 ( 1.00x)
vvc_alf_classify_32x32_8_neon: 10.7 ( 1.29x)
vvc_alf_classify_32x32_10_c: 12.3 ( 1.00x)
vvc_alf_classify_32x32_10_neon: 8.7 ( 1.42x)
vvc_alf_classify_32x32_12_c: 12.2 ( 1.00x)
vvc_alf_classify_32x32_12_neon: 8.7 ( 1.40x)
vvc_alf_classify_64x64_8_c: 45.8 ( 1.00x)
vvc_alf_classify_64x64_8_neon: 37.1 ( 1.23x)
vvc_alf_classify_64x64_10_c: 41.3 ( 1.00x)
vvc_alf_classify_64x64_10_neon: 32.8 ( 1.26x)
vvc_alf_classify_64x64_12_c: 41.4 ( 1.00x)
vvc_alf_classify_64x64_12_neon: 32.4 ( 1.28x)
vvc_alf_classify_128x128_8_c: 163.7 ( 1.00x)
vvc_alf_classify_128x128_8_neon: 138.3 ( 1.18x)
vvc_alf_classify_128x128_10_c: 149.1 ( 1.00x)
vvc_alf_classify_128x128_10_neon: 120.3 ( 1.24x)
vvc_alf_classify_128x128_12_c: 148.7 ( 1.00x)
vvc_alf_classify_128x128_12_neon: 119.4 ( 1.25x)
RPi4 (Cortex-A72):
vvc_alf_classify_8x8_8_c: 1251.6 ( 1.00x)
vvc_alf_classify_8x8_8_neon: 700.7 ( 1.79x)
vvc_alf_classify_8x8_10_c: 1141.9 ( 1.00x)
vvc_alf_classify_8x8_10_neon: 659.7 ( 1.73x)
vvc_alf_classify_8x8_12_c: 1075.8 ( 1.00x)
vvc_alf_classify_8x8_12_neon: 658.7 ( 1.63x)
vvc_alf_classify_16x16_8_c: 3574.1 ( 1.00x)
vvc_alf_classify_16x16_8_neon: 1849.8 ( 1.93x)
vvc_alf_classify_16x16_10_c: 3270.0 ( 1.00x)
vvc_alf_classify_16x16_10_neon: 1786.1 ( 1.83x)
vvc_alf_classify_16x16_12_c: 3271.7 ( 1.00x)
vvc_alf_classify_16x16_12_neon: 1785.5 ( 1.83x)
vvc_alf_classify_32x32_8_c: 12451.9 ( 1.00x)
vvc_alf_classify_32x32_8_neon: 5984.3 ( 2.08x)
vvc_alf_classify_32x32_10_c: 11428.9 ( 1.00x)
vvc_alf_classify_32x32_10_neon: 5756.3 ( 1.99x)
vvc_alf_classify_32x32_12_c: 11252.8 ( 1.00x)
vvc_alf_classify_32x32_12_neon: 5755.7 ( 1.96x)
vvc_alf_classify_64x64_8_c: 47625.5 ( 1.00x)
vvc_alf_classify_64x64_8_neon: 21071.9 ( 2.26x)
vvc_alf_classify_64x64_10_c: 44576.3 ( 1.00x)
vvc_alf_classify_64x64_10_neon: 21544.7 ( 2.07x)
vvc_alf_classify_64x64_12_c: 44600.5 ( 1.00x)
vvc_alf_classify_64x64_12_neon: 21491.2 ( 2.08x)
vvc_alf_classify_128x128_8_c: 192143.3 ( 1.00x)
vvc_alf_classify_128x128_8_neon: 82387.6 ( 2.33x)
vvc_alf_classify_128x128_10_c: 177583.1 ( 1.00x)
vvc_alf_classify_128x128_10_neon: 81628.8 ( 2.18x)
vvc_alf_classify_128x128_12_c: 177582.2 ( 1.00x)
vvc_alf_classify_128x128_12_neon: 81625.1 ( 2.18x)
2025-09-09 22:13:04 +01:00
Krzysztof Pyrkosz
de25cb4603
avcodec/aarch64/vvc: Optimize vvc_apply_bdof_block_8x
...
Before and after:
A53:
apply_bdof_8_8x16_neon: 3320.5 ( 4.02x)
apply_bdof_10_8x16_neon: 3317.8 ( 3.90x)
apply_bdof_12_8x16_neon: 3303.6 ( 3.91x)
apply_bdof_8_8x16_neon: 3168.1 ( 4.23x)
apply_bdof_10_8x16_neon: 3127.8 ( 4.13x)
apply_bdof_12_8x16_neon: 3119.3 ( 4.18x)
A72:
apply_bdof_8_8x16_neon: 1827.4 ( 5.02x)
apply_bdof_10_8x16_neon: 1838.5 ( 4.89x)
apply_bdof_12_8x16_neon: 1841.1 ( 4.83x)
apply_bdof_8_8x16_neon: 1691.6 ( 5.46x)
apply_bdof_10_8x16_neon: 1695.9 ( 5.23x)
apply_bdof_12_8x16_neon: 1695.4 ( 5.29x)
A78
apply_bdof_8_8x16_neon: 648.9 ( 7.43x)
apply_bdof_10_8x16_neon: 646.1 ( 7.04x)
apply_bdof_12_8x16_neon: 643.8 ( 7.04x)
apply_bdof_8_8x16_neon: 603.2 ( 7.97x)
apply_bdof_10_8x16_neon: 604.1 ( 7.52x)
apply_bdof_12_8x16_neon: 604.5 ( 7.52x)
2025-09-09 16:37:28 +00:00
Krzysztof Pyrkosz
7b21bde34c
avcodec/aarch64/vvc: Implemented dmvr_h_10
...
A78:
dmvr_h_10_12x20_neon: 82.2 ( 6.49x)
dmvr_h_10_20x12_neon: 69.9 ( 3.66x)
dmvr_h_10_20x20_neon: 112.5 ( 3.74x)
dmvr_h_12_12x20_neon: 81.4 ( 6.51x)
dmvr_h_12_20x12_neon: 69.2 ( 3.74x)
dmvr_h_12_20x20_neon: 110.2 ( 3.85x)
A72:
dmvr_h_10_12x20_neon: 234.1 ( 4.67x)
dmvr_h_10_20x12_neon: 221.4 ( 3.48x)
dmvr_h_10_20x20_neon: 356.9 ( 3.59x)
dmvr_h_12_12x20_neon: 234.1 ( 4.67x)
dmvr_h_12_20x12_neon: 221.5 ( 3.53x)
dmvr_h_12_20x20_neon: 357.0 ( 3.64x)
2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz
189e841cfd
avcodec/aarch64/vvc: Implement dmvr_h_8
...
A78:
dmvr_h_8_12x20_neon: 76.6 ( 4.31x)
dmvr_h_8_20x12_neon: 65.8 ( 3.49x)
dmvr_h_8_20x20_neon: 106.6 ( 3.62x)
A72:
dmvr_h_8_12x20_neon: 190.6 ( 4.40x)
dmvr_h_8_20x12_neon: 171.1 ( 4.31x)
dmvr_h_8_20x20_neon: 275.1 ( 4.50x)
2025-09-08 17:51:20 +00:00
Krzysztof Pyrkosz
fb4407797e
Replace uxtl with umull in dmvr_hv_8
...
Before and after on A78:
dmvr_hv_8_12x20_neon: 205.3 ( 5.21x)
dmvr_hv_8_20x12_neon: 171.8 ( 3.15x)
dmvr_hv_8_20x20_neon: 282.7 ( 3.11x)
dmvr_hv_8_12x20_neon: 172.7 ( 5.58x)
dmvr_hv_8_20x12_neon: 133.3 ( 3.36x)
dmvr_hv_8_20x20_neon: 214.6 ( 3.40x)
2025-09-05 07:20:15 +00:00
Zhao Zhili
6ce02bcc3a
avcodec/aarch64/vvc: Optimize apply_bdof
...
Before this patch, prof_grad_filter calculate
gh[0], gh[1], gv[0], gv[1] and save them to stack.
derive_bdof_vx_vy load them from stack and calculate
gh[0] + gh[1], gv[0] + gv[1].
apply_bdof_min_block load them from stack and calculate
gh[0] - gh[1], gv[0] - gv[1]
This patch add bdof_grad_filter, which calculate gh[0] + gh[1],
gh[0] - gh[1], gv[0] + gv[1], gv[0] - gv[1], and save them to
stack, so derive_bdof_vx_vy and apply_bdof_min_block can use the
results directly.
prof_grad_filter is kept for reuse by other functions in the future.
Benchmark on rpi5 with gcc 12
Before After
--------------------------------------------------------------------
apply_bdof_8_8x16_c: | 7431.4 ( 1.00x) | 7371.7 ( 1.00x)
apply_bdof_8_8x16_neon: | 1175.4 ( 6.32x) | 1036.3 ( 7.11x)
apply_bdof_8_16x8_c: | 7182.2 ( 1.00x) | 7201.1 ( 1.00x)
apply_bdof_8_16x8_neon: | 1021.7 ( 7.03x) | 879.9 ( 8.18x)
apply_bdof_8_16x16_c: | 14577.1 ( 1.00x) | 14589.3 ( 1.00x)
apply_bdof_8_16x16_neon: | 2012.8 ( 7.24x) | 1743.3 ( 8.37x)
apply_bdof_10_8x16_c: | 7292.4 ( 1.00x) | 7308.5 ( 1.00x)
apply_bdof_10_8x16_neon: | 1156.3 ( 6.31x) | 1045.3 ( 6.99x)
apply_bdof_10_16x8_c: | 7112.4 ( 1.00x) | 7214.4 ( 1.00x)
apply_bdof_10_16x8_neon: | 1007.6 ( 7.06x) | 904.8 ( 7.97x)
apply_bdof_10_16x16_c: | 14363.3 ( 1.00x) | 14476.4 ( 1.00x)
apply_bdof_10_16x16_neon: | 1986.9 ( 7.23x) | 1783.1 ( 8.12x)
apply_bdof_12_8x16_c: | 7433.3 ( 1.00x) | 7374.7 ( 1.00x)
apply_bdof_12_8x16_neon: | 1155.9 ( 6.43x) | 1040.8 ( 7.09x)
apply_bdof_12_16x8_c: | 7171.1 ( 1.00x) | 7376.3 ( 1.00x)
apply_bdof_12_16x8_neon: | 1010.8 ( 7.09x) | 899.4 ( 8.20x)
apply_bdof_12_16x16_c: | 14515.5 ( 1.00x) | 14731.5 ( 1.00x)
apply_bdof_12_16x16_neon: | 1988.4 ( 7.30x) | 1785.2 ( 8.25x)
2025-09-03 06:55:37 +00:00
Zhao Zhili
2e92417603
avcodec/aarch64/vvc: Optimize derive_bdof_vx_vy
...
Implement line tricks and pixel tricks. See comments in inter.S
for details.
Benchmark on rpi5 with gcc 12
Before After
-----------------------------------------------------------------
apply_bdof_8_8x16_c: | 7375.5 ( 1.00x) | 7473.8 ( 1.00x)
apply_bdof_8_8x16_neon: | 1875.1 ( 3.93x) | 1135.8 ( 6.58x)
apply_bdof_8_16x8_c: | 7273.9 ( 1.00x) | 7204.0 ( 1.00x)
apply_bdof_8_16x8_neon: | 1738.2 ( 4.18x) | 1013.0 ( 7.11x)
apply_bdof_8_16x16_c: | 14744.9 ( 1.00x) | 14712.6 ( 1.00x)
apply_bdof_8_16x16_neon: | 3446.7 ( 4.28x) | 1997.7 ( 7.36x)
apply_bdof_10_8x16_c: | 7352.4 ( 1.00x) | 7485.7 ( 1.00x)
apply_bdof_10_8x16_neon: | 1861.0 ( 3.95x) | 1134.1 ( 6.60x)
apply_bdof_10_16x8_c: | 7330.5 ( 1.00x) | 7232.8 ( 1.00x)
apply_bdof_10_16x8_neon: | 1747.2 ( 4.20x) | 1002.6 ( 7.21x)
apply_bdof_10_16x16_c: | 14522.4 ( 1.00x) | 14664.8 ( 1.00x)
apply_bdof_10_16x16_neon: | 3490.5 ( 4.16x) | 1978.4 ( 7.41x)
apply_bdof_12_8x16_c: | 7389.0 ( 1.00x) | 7380.1 ( 1.00x)
apply_bdof_12_8x16_neon: | 1861.3 ( 3.97x) | 1134.0 ( 6.51x)
apply_bdof_12_16x8_c: | 7283.1 ( 1.00x) | 7336.9 ( 1.00x)
apply_bdof_12_16x8_neon: | 1749.1 ( 4.16x) | 1002.3 ( 7.32x)
apply_bdof_12_16x16_c: | 14580.7 ( 1.00x) | 14502.7 ( 1.00x)
apply_bdof_12_16x16_neon: | 3472.9 ( 4.20x) | 1978.3 ( 7.33x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2025-09-03 06:55:37 +00:00
Zhao Zhili
39786f8cd5
aarch64/h26x: optimize sao_band_filter
...
int8_t[] is enough for offset_table of 8 bit streams.
On rpi5:
Before After
hevc_sao_band_8_8_c: 252.3 ( 1.00x) 252.3 ( 1.00x)
hevc_sao_band_8_8_neon: 95.8 ( 2.63x) 61.0 ( 4.57x)
hevc_sao_band_16_8_c: 875.2 ( 1.00x) 864.9 ( 1.00x)
hevc_sao_band_16_8_neon: 317.5 ( 2.76x) 150.0 ( 6.26x)
hevc_sao_band_32_8_c: 3853.5 ( 1.00x) 3871.6 ( 1.00x)
hevc_sao_band_32_8_neon: 1222.3 ( 3.15x) 550.6 ( 7.39)
hevc_sao_band_48_8_c: 8203.6 ( 1.00x) 8182.6 ( 1.00x)
hevc_sao_band_48_8_neon: 2685.7 ( 3.05x) 1185.8 ( 7.36x)
hevc_sao_band_64_8_c: 14023.0 ( 1.00x) 14038.9 ( 1.00x)
hevc_sao_band_64_8_neon: 4783.2 ( 2.93x) 2078.4 ( 7.15x)
Reviewed-by: Martin Storsjö <martin@martin.st >
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2025-04-29 15:11:45 +08:00
Krzysztof Pyrkosz
f9b8f30680
avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
...
This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that.
Benchmarks before and after:
A78
avg_8_64x64_neon: 2686.2 ( 6.12x)
avg_8_128x128_neon: 10734.2 ( 5.88x)
avg_10_64x64_neon: 2536.8 ( 5.40x)
avg_10_128x128_neon: 10079.0 ( 5.22x)
avg_12_64x64_neon: 2548.2 ( 5.38x)
avg_12_128x128_neon: 10133.8 ( 5.19x)
avg_8_64x64_neon: 897.8 (18.26x)
avg_8_128x128_neon: 3608.5 (17.37x)
avg_10_32x32_neon: 444.2 ( 8.51x)
avg_10_64x64_neon: 1711.8 ( 8.00x)
avg_12_64x64_neon: 1706.2 ( 8.02x)
avg_12_128x128_neon: 7010.0 ( 7.46x)
A72
avg_8_64x64_neon: 5823.4 ( 3.88x)
avg_8_128x128_neon: 17430.5 ( 4.73x)
avg_10_64x64_neon: 5228.1 ( 3.71x)
avg_10_128x128_neon: 16722.2 ( 4.17x)
avg_12_64x64_neon: 5379.1 ( 3.51x)
avg_12_128x128_neon: 16715.7 ( 4.17x)
avg_8_64x64_neon: 2006.5 (10.61x)
avg_8_128x128_neon: 9158.7 ( 8.96x)
avg_10_64x64_neon: 3357.7 ( 5.60x)
avg_10_128x128_neon: 12411.7 ( 5.56x)
avg_12_64x64_neon: 3317.5 ( 5.67x)
avg_12_128x128_neon: 12358.5 ( 5.58x)
A53
avg_8_64x64_neon: 8327.8 ( 5.18x)
avg_8_128x128_neon: 31631.3 ( 5.34x)
avg_10_64x64_neon: 8783.5 ( 4.98x)
avg_10_128x128_neon: 32617.0 ( 5.25x)
avg_12_64x64_neon: 8686.0 ( 5.06x)
avg_12_128x128_neon: 32487.5 ( 5.25x)
avg_8_64x64_neon: 6032.3 ( 7.17x)
avg_8_128x128_neon: 22008.5 ( 7.69x)
avg_10_64x64_neon: 7738.0 ( 5.68x)
avg_10_128x128_neon: 27813.8 ( 6.14x)
avg_12_64x64_neon: 7844.5 ( 5.60x)
avg_12_128x128_neon: 26999.5 ( 6.34x)
Signed-off-by: Martin Storsjö <martin@martin.st >
2025-03-07 15:51:20 +02:00
Krzysztof Pyrkosz
71a91485fa
avcodec/aarch64/vvc: Optimize NEON version of vvc_dmvr
...
This patch replaces blocks of instructions performing rounding and
widening shifts with one-liners achieving the same result.
Before and after on A78
dmvr_8_12x20_neon: 86.2 ( 6.90x)
dmvr_8_20x12_neon: 94.8 ( 5.93x)
dmvr_8_20x20_neon: 141.5 ( 6.50x)
dmvr_12_12x20_neon: 158.0 ( 3.76x)
dmvr_12_20x12_neon: 151.2 ( 3.73x)
dmvr_12_20x20_neon: 247.2 ( 3.71x)
dmvr_hv_8_12x20_neon: 423.2 ( 3.75x)
dmvr_hv_8_20x12_neon: 434.0 ( 3.69x)
dmvr_hv_8_20x20_neon: 706.0 ( 3.69x)
dmvr_8_12x20_neon: 77.2 ( 7.70x)
dmvr_8_20x12_neon: 66.5 ( 8.49x)
dmvr_8_20x20_neon: 92.2 ( 9.90x)
dmvr_12_12x20_neon: 80.2 ( 7.38x)
dmvr_12_20x12_neon: 58.2 ( 9.59x)
dmvr_12_20x20_neon: 90.0 (10.15x)
dmvr_hv_8_12x20_neon: 369.0 ( 4.34x)
dmvr_hv_8_20x12_neon: 355.8 ( 4.49x)
dmvr_hv_8_20x20_neon: 574.2 ( 4.51x)
Signed-off-by: Martin Storsjö <martin@martin.st >
2025-03-04 10:35:31 +02:00
Zhao Zhili
952508ae05
aarch64/vvc: Add apply_bdof
...
Test on rpi 5 with gcc 12:
apply_bdof_8_8x16_c: 7315.2 ( 1.00x)
apply_bdof_8_8x16_neon: 1876.8 ( 3.90x)
apply_bdof_8_16x8_c: 7170.5 ( 1.00x)
apply_bdof_8_16x8_neon: 1752.8 ( 4.09x)
apply_bdof_8_16x16_c: 14695.2 ( 1.00x)
apply_bdof_8_16x16_neon: 3490.5 ( 4.21x)
apply_bdof_10_8x16_c: 7371.5 ( 1.00x)
apply_bdof_10_8x16_neon: 1863.8 ( 3.96x)
apply_bdof_10_16x8_c: 7172.0 ( 1.00x)
apply_bdof_10_16x8_neon: 1766.0 ( 4.06x)
apply_bdof_10_16x16_c: 14551.5 ( 1.00x)
apply_bdof_10_16x16_neon: 3576.0 ( 4.07x)
apply_bdof_12_8x16_c: 7236.5 ( 1.00x)
apply_bdof_12_8x16_neon: 1863.8 ( 3.88x)
apply_bdof_12_16x8_c: 7316.5 ( 1.00x)
apply_bdof_12_16x8_neon: 1758.8 ( 4.16x)
apply_bdof_12_16x16_c: 14691.2 ( 1.00x)
apply_bdof_12_16x16_neon: 3480.5 ( 4.22x)
2024-12-21 11:54:44 +08:00
Martin Storsjö
2bb00ef59c
aarch64: vvc: Fix building the dmvr_hv assembly with older MSVC versions
...
Explicitly use ldur for unaligned offsets; newer versions of
armasm64 implicitly convert ldr to ldur as necessary, but older
versions require it explicitly written out.
This fixes these build errors:
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) :
error A2518: operand 2: Memory offset must be aligned
ldr s5, [x1, #1 ]
ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) :
error A2518: operand 2: Memory offset must be aligned
ldr d7, [x1, #2 ]
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-12-18 13:45:09 +02:00
Zhao Zhili
40feba5f77
aarch64/vvc: Fix clip in alf
...
Fix test failure:
./tests/checkasm/checkasm --test=vvc_alf 3607569773
2024-12-10 21:00:47 +08:00
Zhao Zhili
91436638de
aarch64/vvc: Use faster clip operation
...
Replace sqxtn+smin+smax by sqxtun+umin.
2024-12-10 21:00:47 +08:00
Zhao Zhili
bfed5f6b7d
aarch64/vvc: Reuse ff_vvc_put_pel_pixels for chroma
2024-12-10 21:00:47 +08:00
Zhao Zhili
5988a2729b
aarch64/vvc: Add dmvr
...
dmvr_8_12x20_c: 1.5 ( 1.00x)
dmvr_8_12x20_neon: 0.2 ( 6.56x)
dmvr_8_20x12_c: 1.0 ( 1.00x)
dmvr_8_20x12_neon: 0.2 ( 4.33x)
dmvr_8_20x20_c: 1.7 ( 1.00x)
dmvr_8_20x20_neon: 0.5 ( 3.63x)
dmvr_12_12x20_c: 2.2 ( 1.00x)
dmvr_12_12x20_neon: 0.5 ( 4.68x)
dmvr_12_20x12_c: 2.0 ( 1.00x)
dmvr_12_20x12_neon: 0.5 ( 4.16x)
dmvr_12_20x20_c: 3.7 ( 1.00x)
dmvr_12_20x20_neon: 0.7 ( 5.14x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-10-01 10:28:54 +08:00
Zhao Zhili
bcd65ebd8f
aarch64/vvc: Add dmvr_hv
...
dmvr_hv_8_12x20_c: 8.0 ( 1.00x)
dmvr_hv_8_12x20_neon: 1.2 ( 6.62x)
dmvr_hv_8_20x12_c: 8.0 ( 1.00x)
dmvr_hv_8_20x12_neon: 0.9 ( 8.37x)
dmvr_hv_8_20x20_c: 12.9 ( 1.00x)
dmvr_hv_8_20x20_neon: 1.7 ( 7.62x)
dmvr_hv_10_12x20_c: 7.0 ( 1.00x)
dmvr_hv_10_12x20_neon: 1.7 ( 4.09x)
dmvr_hv_10_20x12_c: 7.0 ( 1.00x)
dmvr_hv_10_20x12_neon: 1.7 ( 4.09x)
dmvr_hv_10_20x20_c: 11.2 ( 1.00x)
dmvr_hv_10_20x20_neon: 2.7 ( 4.15x)
dmvr_hv_12_12x20_c: 6.5 ( 1.00x)
dmvr_hv_12_12x20_neon: 1.7 ( 3.79x)
dmvr_hv_12_20x12_c: 6.5 ( 1.00x)
dmvr_hv_12_20x12_neon: 1.7 ( 3.79x)
dmvr_hv_12_20x20_c: 10.2 ( 1.00x)
dmvr_hv_12_20x20_neon: 2.2 ( 4.64x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-10-01 10:28:54 +08:00
Zhao Zhili
0ba9e8d0d4
aarch64/vvc: Add w_avg
...
w_avg_8_2x2_c: 0.0 ( 0.00x)
w_avg_8_2x2_neon: 0.0 ( 0.00x)
w_avg_8_4x4_c: 0.2 ( 1.00x)
w_avg_8_4x4_neon: 0.0 ( 0.00x)
w_avg_8_8x8_c: 1.2 ( 1.00x)
w_avg_8_8x8_neon: 0.2 ( 5.00x)
w_avg_8_16x16_c: 4.2 ( 1.00x)
w_avg_8_16x16_neon: 0.8 ( 5.67x)
w_avg_8_32x32_c: 16.2 ( 1.00x)
w_avg_8_32x32_neon: 2.5 ( 6.50x)
w_avg_8_64x64_c: 64.5 ( 1.00x)
w_avg_8_64x64_neon: 9.0 ( 7.17x)
w_avg_8_128x128_c: 269.5 ( 1.00x)
w_avg_8_128x128_neon: 35.5 ( 7.59x)
w_avg_10_2x2_c: 0.2 ( 1.00x)
w_avg_10_2x2_neon: 0.2 ( 1.00x)
w_avg_10_4x4_c: 0.2 ( 1.00x)
w_avg_10_4x4_neon: 0.2 ( 1.00x)
w_avg_10_8x8_c: 1.0 ( 1.00x)
w_avg_10_8x8_neon: 0.2 ( 4.00x)
w_avg_10_16x16_c: 4.2 ( 1.00x)
w_avg_10_16x16_neon: 0.8 ( 5.67x)
w_avg_10_32x32_c: 16.2 ( 1.00x)
w_avg_10_32x32_neon: 2.5 ( 6.50x)
w_avg_10_64x64_c: 66.2 ( 1.00x)
w_avg_10_64x64_neon: 10.0 ( 6.62x)
w_avg_10_128x128_c: 277.8 ( 1.00x)
w_avg_10_128x128_neon: 39.8 ( 6.99x)
w_avg_12_2x2_c: 0.0 ( 0.00x)
w_avg_12_2x2_neon: 0.2 ( 0.00x)
w_avg_12_4x4_c: 0.2 ( 1.00x)
w_avg_12_4x4_neon: 0.0 ( 0.00x)
w_avg_12_8x8_c: 1.2 ( 1.00x)
w_avg_12_8x8_neon: 0.5 ( 2.50x)
w_avg_12_16x16_c: 4.8 ( 1.00x)
w_avg_12_16x16_neon: 0.8 ( 6.33x)
w_avg_12_32x32_c: 17.0 ( 1.00x)
w_avg_12_32x32_neon: 2.8 ( 6.18x)
w_avg_12_64x64_c: 64.0 ( 1.00x)
w_avg_12_64x64_neon: 10.0 ( 6.40x)
w_avg_12_128x128_c: 269.2 ( 1.00x)
w_avg_12_128x128_neon: 42.0 ( 6.41x)
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-10-01 10:28:54 +08:00
Zhao Zhili
3f84d1d1fb
aarch64/vvc: Add avg
...
avg_8_2x2_c: 0.2 ( 1.00x)
avg_8_2x2_neon: 0.2 ( 1.00x)
avg_8_4x4_c: 0.2 ( 1.00x)
avg_8_4x4_neon: 0.2 ( 1.00x)
avg_8_8x8_c: 0.9 ( 1.00x)
avg_8_8x8_neon: 0.2 ( 5.29x)
avg_8_16x16_c: 3.7 ( 1.00x)
avg_8_16x16_neon: 0.7 ( 5.44x)
avg_8_32x32_c: 14.9 ( 1.00x)
avg_8_32x32_neon: 1.7 ( 8.91x)
avg_8_64x64_c: 59.7 ( 1.00x)
avg_8_64x64_neon: 6.9 ( 8.62x)
avg_8_128x128_c: 254.7 ( 1.00x)
avg_8_128x128_neon: 26.9 ( 9.46x)
avg_10_2x2_c: 0.2 ( 1.00x)
avg_10_2x2_neon: 0.2 ( 1.00x)
avg_10_4x4_c: 0.2 ( 1.00x)
avg_10_4x4_neon: 0.2 ( 1.00x)
avg_10_8x8_c: 0.9 ( 1.00x)
avg_10_8x8_neon: 0.2 ( 5.29x)
avg_10_16x16_c: 3.4 ( 1.00x)
avg_10_16x16_neon: 0.4 ( 8.06x)
avg_10_32x32_c: 13.9 ( 1.00x)
avg_10_32x32_neon: 1.9 ( 7.23x)
avg_10_64x64_c: 54.2 ( 1.00x)
avg_10_64x64_neon: 8.4 ( 6.43x)
avg_10_128x128_c: 232.4 ( 1.00x)
avg_10_128x128_neon: 30.9 ( 7.52x)
avg_12_2x2_c: 0.0 ( 0.00x)
avg_12_2x2_neon: 0.2 ( 0.00x)
avg_12_4x4_c: 0.4 ( 1.00x)
avg_12_4x4_neon: 0.2 ( 2.43x)
avg_12_8x8_c: 0.7 ( 1.00x)
avg_12_8x8_neon: 0.2 ( 3.86x)
avg_12_16x16_c: 3.7 ( 1.00x)
avg_12_16x16_neon: 0.4 ( 8.65x)
avg_12_32x32_c: 13.7 ( 1.00x)
avg_12_32x32_neon: 2.2 ( 6.29x)
avg_12_64x64_c: 53.9 ( 1.00x)
avg_12_64x64_neon: 7.7 ( 7.03x)
avg_12_128x128_c: 270.9 ( 1.00x)
avg_12_128x128_neon: 30.4 ( 8.90x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
1be5a2374f
aarch64/vvc: Add put_epel_hv
...
On Apple M1:
put_chroma_hv_8_4x4_c: 1.7 ( 1.00x)
put_chroma_hv_8_4x4_neon: 0.2 ( 7.67x)
put_chroma_hv_8_8x8_c: 5.5 ( 1.00x)
put_chroma_hv_8_8x8_neon: 0.5 (11.53x)
put_chroma_hv_8_16x16_c: 18.5 ( 1.00x)
put_chroma_hv_8_16x16_neon: 1.5 (12.53x)
put_chroma_hv_8_32x32_c: 72.5 ( 1.00x)
put_chroma_hv_8_32x32_neon: 4.7 (15.34x)
put_chroma_hv_8_64x64_c: 274.0 ( 1.00x)
put_chroma_hv_8_64x64_neon: 18.5 (14.83x)
put_chroma_hv_8_128x128_c: 1058.7 ( 1.00x)
put_chroma_hv_8_128x128_neon: 75.2 (14.07x)
On Android Pixel 8 Pro:
put_chroma_hv_8_4x4_c: 1.2 ( 1.00x)
put_chroma_hv_8_4x4_neon: 0.0 ( 0.00x)
put_chroma_hv_8_4x4_i8mm: 0.2 ( 5.00x)
put_chroma_hv_8_8x8_c: 4.0 ( 1.00x)
put_chroma_hv_8_8x8_neon: 0.5 ( 8.00x)
put_chroma_hv_8_8x8_i8mm: 0.5 ( 8.00x)
put_chroma_hv_8_16x16_c: 15.2 ( 1.00x)
put_chroma_hv_8_16x16_neon: 2.5 ( 6.10x)
put_chroma_hv_8_16x16_i8mm: 2.2 ( 6.78x)
put_chroma_hv_8_32x32_c: 61.0 ( 1.00x)
put_chroma_hv_8_32x32_neon: 9.8 ( 6.26x)
put_chroma_hv_8_32x32_i8mm: 8.5 ( 7.18x)
put_chroma_hv_8_64x64_c: 229.5 ( 1.00x)
put_chroma_hv_8_64x64_neon: 38.5 ( 5.96x)
put_chroma_hv_8_64x64_i8mm: 34.0 ( 6.75x)
put_chroma_hv_8_128x128_c: 919.8 ( 1.00x)
put_chroma_hv_8_128x128_neon: 154.5 ( 5.95x)
put_chroma_hv_8_128x128_i8mm: 140.0 ( 6.57x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
0dcf204e5d
aarch64/vvc: Add put_epel_h i8mm
...
put_chroma_h_8_4x4_c: 0.4 ( 1.00x)
put_chroma_h_8_4x4_neon: 0.0 ( 0.00x)
put_chroma_h_8_4x4_i8mm: 0.1 ( 2.67x)
put_chroma_h_8_8x8_c: 1.6 ( 1.00x)
put_chroma_h_8_8x8_neon: 0.1 (11.00x)
put_chroma_h_8_8x8_i8mm: 0.1 (11.00x)
put_chroma_h_8_16x16_c: 6.9 ( 1.00x)
put_chroma_h_8_16x16_neon: 1.1 ( 6.00x)
put_chroma_h_8_16x16_i8mm: 0.7 (10.62x)
put_chroma_h_8_32x32_c: 27.6 ( 1.00x)
put_chroma_h_8_32x32_neon: 4.7 ( 5.95x)
put_chroma_h_8_32x32_i8mm: 4.4 ( 6.28x)
put_chroma_h_8_64x64_c: 116.2 ( 1.00x)
put_chroma_h_8_64x64_neon: 19.1 ( 6.07x)
put_chroma_h_8_64x64_i8mm: 17.1 ( 6.77x)
put_chroma_h_8_128x128_c: 466.6 ( 1.00x)
put_chroma_h_8_128x128_neon: 81.4 ( 5.73x)
put_chroma_h_8_128x128_i8mm: 71.7 ( 6.51x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
41a1885f7a
aarch64/vvc: Add put_epel_h
...
put_chroma_h_8_4x4_c: 0.2 ( 1.00x)
put_chroma_h_8_4x4_neon: 0.2 ( 1.00x)
put_chroma_h_8_8x8_c: 0.8 ( 1.00x)
put_chroma_h_8_8x8_neon: 0.2 ( 3.00x)
put_chroma_h_8_16x16_c: 3.8 ( 1.00x)
put_chroma_h_8_16x16_neon: 0.8 ( 5.00x)
put_chroma_h_8_32x32_c: 12.5 ( 1.00x)
put_chroma_h_8_32x32_neon: 2.2 ( 5.56x)
put_chroma_h_8_64x64_c: 47.0 ( 1.00x)
put_chroma_h_8_64x64_neon: 8.8 ( 5.37x)
put_chroma_h_8_128x128_c: 200.2 ( 1.00x)
put_chroma_h_8_128x128_neon: 31.8 ( 6.31x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
260e1b4b62
aarch64/vvc: Add sad
...
sad_8x16_c: 0.8 ( 1.00x)
sad_8x16_neon: 0.2 ( 3.00x)
sad_16x8_c: 0.5 ( 1.00x)
sad_16x8_neon: 0.2 ( 2.00x)
sad_16x16_c: 1.5 ( 1.00x)
sad_16x16_neon: 0.2 ( 6.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
5ac6925803
aarch64/vvc: Add put_qpel_hv
...
With Apple M1 (no i8mm):
put_luma_hv_8_4x4_c: 2.2 ( 1.00x)
put_luma_hv_8_4x4_neon: 0.8 ( 3.00x)
put_luma_hv_8_8x8_c: 7.0 ( 1.00x)
put_luma_hv_8_8x8_neon: 0.8 ( 9.33x)
put_luma_hv_8_16x16_c: 22.8 ( 1.00x)
put_luma_hv_8_16x16_neon: 2.5 ( 9.10x)
put_luma_hv_8_32x32_c: 84.8 ( 1.00x)
put_luma_hv_8_32x32_neon: 9.5 ( 8.92x)
put_luma_hv_8_64x64_c: 333.0 ( 1.00x)
put_luma_hv_8_64x64_neon: 35.5 ( 9.38x)
put_luma_hv_8_128x128_c: 1294.5 ( 1.00x)
put_luma_hv_8_128x128_neon: 137.8 ( 9.40x)
With Pixel 8 Pro:
put_luma_hv_8_4x4_c: 5.0 ( 1.00x)
put_luma_hv_8_4x4_neon: 0.8 ( 6.67x)
put_luma_hv_8_4x4_i8mm: 0.2 (20.00x)
put_luma_hv_8_8x8_c: 13.2 ( 1.00x)
put_luma_hv_8_8x8_neon: 1.2 (10.60x)
put_luma_hv_8_8x8_i8mm: 1.2 (10.60x)
put_luma_hv_8_16x16_c: 44.2 ( 1.00x)
put_luma_hv_8_16x16_neon: 4.5 ( 9.83x)
put_luma_hv_8_16x16_i8mm: 4.2 (10.41x)
put_luma_hv_8_32x32_c: 160.8 ( 1.00x)
put_luma_hv_8_32x32_neon: 17.5 ( 9.19x)
put_luma_hv_8_32x32_i8mm: 16.0 (10.05x)
put_luma_hv_8_64x64_c: 611.2 ( 1.00x)
put_luma_hv_8_64x64_neon: 68.0 ( 8.99x)
put_luma_hv_8_64x64_i8mm: 62.2 ( 9.82x)
put_luma_hv_8_128x128_c: 2384.8 ( 1.00x)
put_luma_hv_8_128x128_neon: 268.8 ( 8.87x)
put_luma_hv_8_128x128_i8mm: 245.8 ( 9.70x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
a0b52afd32
aarch64/vvc: Add put_qpel_vx
...
put_luma_v_8_4x4_c: 1.0 ( 1.00x)
put_luma_v_8_4x4_neon: 0.0 ( 0.00x)
put_luma_v_8_8x8_c: 3.5 ( 1.00x)
put_luma_v_8_8x8_neon: 0.5 ( 7.00x)
put_luma_v_8_16x16_c: 13.8 ( 1.00x)
put_luma_v_8_16x16_neon: 1.2 (11.00x)
put_luma_v_8_32x32_c: 54.2 ( 1.00x)
put_luma_v_8_32x32_neon: 5.0 (10.85x)
put_luma_v_8_64x64_c: 217.5 ( 1.00x)
put_luma_v_8_64x64_neon: 18.8 (11.60x)
put_luma_v_8_128x128_c: 886.2 ( 1.00x)
put_luma_v_8_128x128_neon: 74.0 (11.98x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
9f6c8eb412
aarch64/vvc: Add put_qpel_hx i8mm
...
Benchmark on Android pixel 8 with -fno-vectorize
put_luma_h_8_4x4_c: 0.2 ( 1.00x)
put_luma_h_8_4x4_neon: 0.2 ( 1.00x)
put_luma_h_8_4x4_i8mm: 0.0 ( 0.00x)
put_luma_h_8_8x8_c: 1.5 ( 1.00x)
put_luma_h_8_8x8_neon: 0.5 ( 3.00x)
put_luma_h_8_8x8_i8mm: 0.5 ( 3.00x)
put_luma_h_8_16x16_c: 6.2 ( 1.00x)
put_luma_h_8_16x16_neon: 2.0 ( 3.12x)
put_luma_h_8_16x16_i8mm: 1.5 ( 4.17x)
put_luma_h_8_32x32_c: 25.5 ( 1.00x)
put_luma_h_8_32x32_neon: 9.0 ( 2.83x)
put_luma_h_8_32x32_i8mm: 6.8 ( 3.78x)
put_luma_h_8_64x64_c: 99.8 ( 1.00x)
put_luma_h_8_64x64_neon: 35.2 ( 2.83x)
put_luma_h_8_64x64_i8mm: 27.2 ( 3.66x)
put_luma_h_8_128x128_c: 422.0 ( 1.00x)
put_luma_h_8_128x128_neon: 138.5 ( 3.05x)
put_luma_h_8_128x128_i8mm: 109.2 ( 3.86x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
25448d1716
aarch64/vvc: Add put_pel/put_pel_uni/put_pel_uni_w
...
put_luma_pixels_8_4x4_c: 0.2 ( 1.00x)
put_luma_pixels_8_4x4_neon: 0.2 ( 1.00x)
put_luma_pixels_8_8x8_c: 0.7 ( 1.00x)
put_luma_pixels_8_8x8_neon: 0.2 ( 3.22x)
put_luma_pixels_8_16x16_c: 2.2 ( 1.00x)
put_luma_pixels_8_16x16_neon: 0.2 ( 9.89x)
put_luma_pixels_8_32x32_c: 8.2 ( 1.00x)
put_luma_pixels_8_32x32_neon: 1.2 ( 6.71x)
put_luma_pixels_8_64x64_c: 33.7 ( 1.00x)
put_luma_pixels_8_64x64_neon: 2.5 (13.63x)
put_luma_pixels_8_128x128_c: 145.5 ( 1.00x)
put_luma_pixels_8_128x128_neon: 10.2 (14.23x)
put_uni_pixels_luma_8_4x4_c: 0.5 ( 1.00x)
put_uni_pixels_luma_8_4x4_neon: 0.0 ( 0.00x)
put_uni_pixels_luma_8_8x8_c: 0.5 ( 1.00x)
put_uni_pixels_luma_8_8x8_neon: 0.2 ( 2.11x)
put_uni_pixels_luma_8_16x16_c: 1.2 ( 1.00x)
put_uni_pixels_luma_8_16x16_neon: 0.2 ( 5.44x)
put_uni_pixels_luma_8_32x32_c: 3.0 ( 1.00x)
put_uni_pixels_luma_8_32x32_neon: 0.5 ( 6.26x)
put_uni_pixels_luma_8_64x64_c: 3.0 ( 1.00x)
put_uni_pixels_luma_8_64x64_neon: 1.7 ( 1.72x)
put_uni_pixels_luma_8_128x128_c: 6.5 ( 1.00x)
put_uni_pixels_luma_8_128x128_neon: 6.5 ( 1.00x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
20f2bf5530
aarch64/vvc: Add put_qpel_h_* and put_qpel_uni_h_*
...
Just share hevc implementation.
checkasm --test=vvc_mc --benchmark:
put_luma_h_8_4x4_c: 0.2 ( 1.00x)
put_luma_h_8_4x4_neon: 0.2 ( 1.00x)
put_luma_h_8_8x8_c: 1.0 ( 1.00x)
put_luma_h_8_8x8_neon: 0.2 ( 4.33x)
put_luma_h_8_16x16_c: 3.2 ( 1.00x)
put_luma_h_8_16x16_neon: 1.2 ( 2.63x)
put_luma_h_8_32x32_c: 13.7 ( 1.00x)
put_luma_h_8_32x32_neon: 4.0 ( 3.45x)
put_luma_h_8_64x64_c: 48.2 ( 1.00x)
put_luma_h_8_64x64_neon: 15.7 ( 3.07x)
put_luma_h_8_128x128_c: 203.5 ( 1.00x)
put_luma_h_8_128x128_neon: 62.0 ( 3.28x)
put_uni_h_luma_8_4x4_c: 0.2 ( 1.00x)
put_uni_h_luma_8_4x4_neon: 0.2 ( 1.00x)
put_uni_h_luma_8_8x8_c: 1.5 ( 1.00x)
put_uni_h_luma_8_8x8_neon: 0.2 ( 6.56x)
put_uni_h_luma_8_16x16_c: 5.7 ( 1.00x)
put_uni_h_luma_8_16x16_neon: 1.2 ( 4.67x)
put_uni_h_luma_8_32x32_c: 24.0 ( 1.00x)
put_uni_h_luma_8_32x32_neon: 4.7 ( 5.07x)
put_uni_h_luma_8_64x64_c: 90.0 ( 1.00x)
put_uni_h_luma_8_64x64_neon: 17.0 ( 5.30x)
put_uni_h_luma_8_128x128_c: 357.7 ( 1.00x)
put_uni_h_luma_8_128x128_neon: 67.5 ( 5.30x)
2024-09-14 16:36:34 +08:00
Zhao Zhili
4c0372281b
aarch64/vvc: Bind h26x/sao filter implementation to vvc
...
Reviewed-by: Martin Storsjö <martin@martin.st >
2024-08-31 16:07:50 +08:00
Martin Storsjö
4acb9b7d10
aarch64: vvc: Fix unnecessary extra spaces
...
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-07-23 16:04:28 +03:00
Martin Storsjö
99598629e8
aarch64: vvc: Consistently use # for immediate constants
...
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-07-23 15:24:37 +03:00
Martin Storsjö
400843151d
aarch64: vvc: Fix compilation of alf.S with MSVC 2022 17.7 and older
...
Use the "ldur" instruction explicitly, instead of having the
assembler implicitly convert "ldr" instructions to "ldur".
This fixes build errors like these:
libavcodec\aarch64\vvc\alf.o.asm(1023) : error A2518: operand 2: Memory offset must be aligned
ldr q22, [x3, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1024) : error A2518: operand 2: Memory offset must be aligned
ldr q24, [x2, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1393) : error A2518: operand 2: Memory offset must be aligned
ldr q22, [x3, #24 ]
libavcodec\aarch64\vvc\alf.o.asm(1394) : error A2518: operand 2: Memory offset must be aligned
ldr q24, [x2, #24 ]
Signed-off-by: Martin Storsjö <martin@martin.st >
2024-07-23 15:24:33 +03:00
Zhao Zhili
2d4ef304c9
avcodec/vvc: Add aarch64 neon optimization for ALF
...
vvc_alf_filter_chroma_4x4_8_c: 3.0
vvc_alf_filter_chroma_4x4_8_neon: 1.0
vvc_alf_filter_chroma_4x4_10_c: 2.7
vvc_alf_filter_chroma_4x4_10_neon: 1.0
vvc_alf_filter_chroma_4x4_12_c: 2.7
vvc_alf_filter_chroma_4x4_12_neon: 1.0
vvc_alf_filter_chroma_8x8_8_c: 10.2
vvc_alf_filter_chroma_8x8_8_neon: 3.0
vvc_alf_filter_chroma_8x8_10_c: 10.0
vvc_alf_filter_chroma_8x8_10_neon: 2.5
vvc_alf_filter_chroma_8x8_12_c: 10.0
vvc_alf_filter_chroma_8x8_12_neon: 2.5
vvc_alf_filter_chroma_16x16_8_c: 41.7
vvc_alf_filter_chroma_16x16_8_neon: 11.2
vvc_alf_filter_chroma_16x16_10_c: 39.0
vvc_alf_filter_chroma_16x16_10_neon: 10.0
vvc_alf_filter_chroma_16x16_12_c: 40.2
vvc_alf_filter_chroma_16x16_12_neon: 10.2
vvc_alf_filter_chroma_32x32_8_c: 162.0
vvc_alf_filter_chroma_32x32_8_neon: 45.0
vvc_alf_filter_chroma_32x32_10_c: 155.5
vvc_alf_filter_chroma_32x32_10_neon: 39.5
vvc_alf_filter_chroma_32x32_12_c: 155.5
vvc_alf_filter_chroma_32x32_12_neon: 40.0
vvc_alf_filter_chroma_64x64_8_c: 646.0
vvc_alf_filter_chroma_64x64_8_neon: 175.5
vvc_alf_filter_chroma_64x64_10_c: 708.2
vvc_alf_filter_chroma_64x64_10_neon: 166.7
vvc_alf_filter_chroma_64x64_12_c: 619.2
vvc_alf_filter_chroma_64x64_12_neon: 157.2
vvc_alf_filter_chroma_128x128_8_c: 2611.5
vvc_alf_filter_chroma_128x128_8_neon: 698.2
vvc_alf_filter_chroma_128x128_10_c: 2470.0
vvc_alf_filter_chroma_128x128_10_neon: 616.0
vvc_alf_filter_chroma_128x128_12_c: 2531.5
vvc_alf_filter_chroma_128x128_12_neon: 620.2
vvc_alf_filter_luma_8x8_8_c: 25.2
vvc_alf_filter_luma_8x8_8_neon: 4.2
vvc_alf_filter_luma_8x8_10_c: 18.5
vvc_alf_filter_luma_8x8_10_neon: 4.0
vvc_alf_filter_luma_8x8_12_c: 19.0
vvc_alf_filter_luma_8x8_12_neon: 4.0
vvc_alf_filter_luma_16x16_8_c: 106.5
vvc_alf_filter_luma_16x16_8_neon: 16.2
vvc_alf_filter_luma_16x16_10_c: 75.2
vvc_alf_filter_luma_16x16_10_neon: 14.7
vvc_alf_filter_luma_16x16_12_c: 79.7
vvc_alf_filter_luma_16x16_12_neon: 14.7
vvc_alf_filter_luma_32x32_8_c: 400.5
vvc_alf_filter_luma_32x32_8_neon: 63.2
vvc_alf_filter_luma_32x32_10_c: 299.2
vvc_alf_filter_luma_32x32_10_neon: 57.7
vvc_alf_filter_luma_32x32_12_c: 299.2
vvc_alf_filter_luma_32x32_12_neon: 57.7
vvc_alf_filter_luma_64x64_8_c: 1602.5
vvc_alf_filter_luma_64x64_8_neon: 251.7
vvc_alf_filter_luma_64x64_10_c: 1197.0
vvc_alf_filter_luma_64x64_10_neon: 235.5
vvc_alf_filter_luma_64x64_12_c: 1220.2
vvc_alf_filter_luma_64x64_12_neon: 235.7
vvc_alf_filter_luma_128x128_8_c: 6570.2
vvc_alf_filter_luma_128x128_8_neon: 1007.7
vvc_alf_filter_luma_128x128_10_c: 4822.7
vvc_alf_filter_luma_128x128_10_neon: 936.2
vvc_alf_filter_luma_128x128_12_c: 4791.2
vvc_alf_filter_luma_128x128_12_neon: 938.5
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com >
2024-07-22 21:09:56 +08:00