77 Commits

Author SHA1 Message Date
Niklas Haas
a797e30f71 swscale/aarch64/ops: compute SWS_OP_PACK mask directly
Instead of implicitly relying on SwsComps.unused, which contains the exact
same information. (cf. ff_sws_op_list_update_comps)

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-16 23:25:17 +02:00
Niklas Haas
6d1e549195 swscale/aarch64/ops: use SWS_OP_NEEDED() instead of next->comps.unused
These are basically identical, but the latter is being phased out.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-16 23:25:17 +02:00
Niklas Haas
18cc71fc8e swscale/aarch64/ops: fix SWS_OP_LINEAR mask check
The implementation of AARCH64_SWS_OP_LINEAR loops over elements of this mask
to determine which *output* rows to compute. However, it is being set by this
loop to `op->comps.unused`, which is a mask of unused *input* rows. As such,
it should be looking at `next->comps.unused` instead.

This did not result in problems in practice, because none of the linear
matrices happened to trigger this case (more input columns than output rows).

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-16 23:25:17 +02:00
Niklas Haas
85bef2c2bc swscale/ops: split SwsConst up into op-specific structs
It was a bit clunky, lacked semantic contextual information, and made it
harder to reason about the effects of extending this struct. There should be
zero runtime overhead as a result of the fact that this is already a big
union.

I made the changes in this commit by hand, but due to the length and noise
level of the commit, I used Opus 4.6 to verify that I did not accidentally
introduce any bugs or typos.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-02 11:48:15 +00:00
Niklas Haas
32ba5c13de swscale/ops_chain: split generic setup helpers into op-specific helpers
This has the side benefit of not relying on the q2pixel macro to avoid division
by zero, since we can now explicitly avoid operating on undefined clear values.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-02 11:48:15 +00:00
Ramiro Polla
53537f6cf5 swscale/aarch64: mark CPS kernel functions as indirect branch targets
Only the process functions are entered via an indirect _call_ from C.
The kernel functions and process_return are dispatched to by indirect
_branches_ instead (continuation-passing style design).

Make use of the recently added "jumpable" parameter to the function
macro in libavutil/aarch64/asm.S to fix these functions when BTI is
enabled.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
2026-03-31 11:48:52 +00:00
Ramiro Polla
2517c328fc swscale/aarch64: add NEON sws_ops backend
This commit pieces together the previous few commits to implement the
NEON backend for sws_ops.

In essence, a tool which runs on the target (sws_ops_aarch64) is used
to enumerate all the functions that the backend needs to implement. The
list it generates is stored in the repository (ops_entries.c).

The list from above is used at build time by a code generator tool
(ops_asmgen) to implement all the sws_ops functions the NEON backend
supports, and generate a lookup function in C to retrieve the assembly
function pointers.

At runtime, the NEON backend fetches the function pointers to the
assembly functions and chains them together in a continuation-passing
style design, similar to the x86 backend.

The following speedup is observed from legacy swscale to NEON:
A520: Overall speedup=3.780x faster, min=0.137x max=91.928x
A720: Overall speedup=4.129x faster, min=0.234x max=92.424x

And the following from the C sws_ops implementation to NEON:
A520: Overall speedup=5.513x faster, min=0.927x max=14.169x
A720: Overall speedup=4.786x faster, min=0.585x max=20.157x

The slowdowns from legacy to NEON are the same for C/x86. Mostly low
bit-depth conversions that did not perform dithering in legacy.

The 0.585x outlier from C to NEON is gbrpf32le -> gbrapf32le, which is
mostly memcpy with the C implementation. All other conversions are
better.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
2026-03-30 11:38:35 +00:00
Ramiro Polla
534757926f swscale/aarch64: introduce ops_asmgen for NEON backend
The NEON sws_ops backend follows the same continuation-passing style
design as the x86 backend.

Unlike the C and x86 backends, which implement the various operation
functions through the use of templates and preprocessor macros, the
NEON backend uses a build-time code generator, which is introduced by
this commit.

This code generator has two modes of operation:
 -ops:
  Generates an assembly file in GNU assembler syntax targeting AArch64,
  which implements all the sws_ops functions the NEON backend supports.
 -lookup:
  Generates a C function with a hierarchical condition chain that
  returns the pointer to one of the functions generated above, based on
  a given set of parameters derived from SwsOp.

This is the core of the NEON sws_ops backend.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
2026-03-30 11:38:35 +00:00
Ramiro Polla
991611536c swscale/aarch64: introduce a runtime aarch64 assembler interface
The runtime assembler interface provides an instruction-level IR and
builder API tailored to the needs of the swscale dynamic pipeline.
It is not meant to be a general purpose assembler interface.

Currently only a static file backend, which emits GNU assembler text,
has been implemented. In the future, this interface will be used to
write functions dynamically at runtime.

This code will be compiled both for runtime usage to generate optimized
functions and for build-time usage to generate static assembly files.
Therefore, it must not depend on internal FFmpeg libraries.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
2026-03-30 11:38:35 +00:00
Ramiro Polla
a1bfaa0e78 swscale/aarch64: introduce tool to enumerate sws_ops for NEON backend
The NEON sws_ops backend will use a build-time code generator for the
various operation functions it needs to implement. This build time code
generator (ops_asmgen) will need a list of the operations that must be
implemented. This commit adds a tool (sws_ops_aarch64) that generates
such a list (ops_entries.c).

The list is generated by iterating over all possible conversion
combinations and collecting the parameters for each NEON assembly
function that has to be implemented, defined by an unique set of
parameters derived from SwsOp. Whenever swscale evolves, with improved
optimization passes, new pixel formats, or improvements to the backend
itself, this file (ops_entries.c) should be regenerated by running:
    $ make sws_ops_entries_aarch64

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
2026-03-30 11:38:35 +00:00
David Christle
2c7fe8d8ad swscale/aarch64: add NEON rgb32tobgr24 and rgb24tobgr32 conversions
Add NEON alpha drop/insert using ldp+tbl+stp instead of ld4/st3 and
ld3/st4 structure operations. Both use a 2-register sliding-window
tbl with post-indexed addressing. Instruction scheduling targets
narrow in-order cores (A55) while remaining neutral on wide OoO.

Scalar tails use coalesced loads/stores (ldr+strh+lsr+strb for alpha
drop, ldrh+ldrb+orr+str for alpha insert) to reduce per-pixel
instruction count. Independent instructions placed between loads and
dependent operations to fill load-use latency on in-order cores.

checkasm --bench on Apple M3 Max (decicycles, 1920px):
  rgb32tobgr24_c:    114.4 ( 1.00x)
  rgb32tobgr24_neon:  64.3 ( 1.78x)
  rgb24tobgr32_c:    128.9 ( 1.00x)
  rgb24tobgr32_neon:  80.9 ( 1.59x)

C baseline is clang auto-vectorized; speedup is over compiler NEON.

Signed-off-by: David Christle <dev@christle.is>
2026-03-04 10:30:08 +00:00
David Christle
ddd720ae61 swscale/aarch64: add NEON rgb24tobgr24 byte-swap
Add a NEON rgb24tobgr24 using ld3/st3 to swap R and B channels in
packed 24bpp RGB buffers. Handles all input sizes with a 16-pixel
NEON fast path, 8-pixel NEON cleanup, and scalar tail.

checkasm --bench on Apple M3 Max (1920*3 = 5760 bytes):
  rgb24tobgr24_c:    722.0 ( 1.00x)
  rgb24tobgr24_neon:  94.9 ( 7.61x)

Signed-off-by: David Christle <dev@christle.is>
2026-03-04 10:30:08 +00:00
David Christle
7fab0becab swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion
Add ARM64 NEON-accelerated unscaled YUV-to-RGB conversion for planar
YUV input formats. This extends the existing NV12/NV21 NEON paths with
YUV420P, YUV422P, and YUVA420P support for all packed RGB output
formats (ARGB, RGBA, ABGR, BGRA, RGB24, BGR24) and planar GBRP.

Register with ff_yuv2rgb_init_aarch64() to also cover the scaled path.

checkasm: all 42 sw_yuv2rgb tests pass.
Speedup vs C at 1920px width (Apple M3 Max, avg of 20 runs):
  yuv420p->rgb24:   4.3x    yuv420p->argb:   3.1x
  yuv422p->rgb24:   5.5x    yuv422p->argb:   4.1x
  yuva420p->argb:   3.5x    yuva420p->rgba:  3.5x

Signed-off-by: David Christle <dev@christle.is>
2026-03-02 13:14:07 +00:00
Arpad Panyik
1f30ff30fb swscale: Add AArch64 Neon path for xyz12Torgb48 LE
Add optimized Neon code path for the little endian case of the
xyz12Torgb48 function. The innermost loop processes the data in 4x2
pixel blocks using software gathers with the matrix multiplication
and clipping done by Neon.

Relative runtime of micro benchmarks after this patch on some
Cortex and Neoverse CPU cores:

 xyz12le_rgb48le    X1      X3      X4    X925      V2
 16x4_neon:       2.55x   4.34x   3.84x   3.31x   3.22x
 32x4_neon:       2.39x   3.63x   3.22x   3.35x   3.29x
 64x4_neon:       2.37x   3.31x   2.91x   3.33x   3.27x
 128x4_neon:      2.34x   3.28x   2.91x   3.35x   3.24x
 256x4_neon:      2.30x   3.17x   2.91x   3.32x   3.10x
 512x4_neon:      2.26x   3.10x   2.91x   3.30x   3.07x
 1024x4_neon:     2.26x   3.07x   2.96x   3.30x   3.05x
 1920x4_neon:     2.26x   3.06x   2.93x   3.28x   3.04x

 xyz12le_rgb48le   A76     A78    A715    A720    A725
 16x4_neon:       2.33x   2.28x   2.53x   3.33x   3.19x
 32x4_neon:       2.35x   2.18x   2.45x   3.23x   3.24x
 64x4_neon:       2.35x   2.16x   2.42x   3.15x   3.21x
 128x4_neon:      2.35x   2.13x   2.39x   3.00x   3.09x
 256x4_neon:      2.36x   2.12x   2.35x   2.85x   2.99x
 512x4_neon:      2.35x   2.14x   2.35x   2.78x   2.95x
 1024x4_neon:     2.31x   2.09x   2.33x   2.80x   2.91x
 1920x4_neon:     2.30x   2.07x   2.32x   2.81x   2.94x

 xyz12le_rgb48le   A55    A510    A520
 16x4_neon:       2.09x   1.92x   2.36x
 32x4_neon:       2.05x   1.89x   2.38x
 64x4_neon:       2.02x   1.77x   2.35x
 128x4_neon:      1.96x   1.74x   2.25x
 256x4_neon:      1.90x   1.72x   2.19x
 512x4_neon:      1.83x   1.75x   2.16x
 1024x4_neon:     1.83x   1.62x   2.15x
 1920x4_neon:     1.82x   1.60x   2.15x

Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>
2025-12-05 10:28:18 +00:00
Dash Santosh
ca2a88c1b3 swscale/output: Implement yuv2nv12cx neon assembly
yuv2nv12cX_2_512_accurate_c:                          3540.1 ( 1.00x)
yuv2nv12cX_2_512_accurate_neon:                        408.0 ( 8.68x)
yuv2nv12cX_2_512_approximate_c:                       3521.4 ( 1.00x)
yuv2nv12cX_2_512_approximate_neon:                     409.2 ( 8.61x)
yuv2nv12cX_4_512_accurate_c:                          4740.0 ( 1.00x)
yuv2nv12cX_4_512_accurate_neon:                        604.4 ( 7.84x)
yuv2nv12cX_4_512_approximate_c:                       4681.9 ( 1.00x)
yuv2nv12cX_4_512_approximate_neon:                     603.3 ( 7.76x)
yuv2nv12cX_8_512_accurate_c:                          7273.1 ( 1.00x)
yuv2nv12cX_8_512_accurate_neon:                       1012.2 ( 7.19x)
yuv2nv12cX_8_512_approximate_c:                       7223.0 ( 1.00x)
yuv2nv12cX_8_512_approximate_neon:                    1015.8 ( 7.11x)
yuv2nv12cX_16_512_accurate_c:                        13762.0 ( 1.00x)
yuv2nv12cX_16_512_accurate_neon:                      1761.4 ( 7.81x)
yuv2nv12cX_16_512_approximate_c:                     13884.0 ( 1.00x)
yuv2nv12cX_16_512_approximate_neon:                   1766.8 ( 7.86x)

Benchmarked on:
Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU
3417 Mhz, 12 Core(s), 12 Logical Processor(s)
2025-08-12 09:05:00 +00:00
Logaprakash Ramajayam
49477972b7 swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
yuv2yuvX_8_2_0_512_accurate_c:                        2213.4 ( 1.00x)
yuv2yuvX_8_2_0_512_accurate_neon:                      147.5 (15.01x)
yuv2yuvX_8_2_0_512_approximate_c:                     2203.9 ( 1.00x)
yuv2yuvX_8_2_0_512_approximate_neon:                   154.1 (14.30x)
yuv2yuvX_8_2_16_512_accurate_c:                       2147.2 ( 1.00x)
yuv2yuvX_8_2_16_512_accurate_neon:                     150.8 (14.24x)
yuv2yuvX_8_2_16_512_approximate_c:                    2149.7 ( 1.00x)
yuv2yuvX_8_2_16_512_approximate_neon:                  146.8 (14.64x)
yuv2yuvX_8_2_32_512_accurate_c:                       2078.9 ( 1.00x)
yuv2yuvX_8_2_32_512_accurate_neon:                     139.0 (14.95x)
yuv2yuvX_8_2_32_512_approximate_c:                    2083.7 ( 1.00x)
yuv2yuvX_8_2_32_512_approximate_neon:                  140.5 (14.84x)
yuv2yuvX_8_2_48_512_accurate_c:                       2010.7 ( 1.00x)
yuv2yuvX_8_2_48_512_accurate_neon:                     138.2 (14.55x)
yuv2yuvX_8_2_48_512_approximate_c:                    2012.6 ( 1.00x)
yuv2yuvX_8_2_48_512_approximate_neon:                  141.2 (14.26x)
yuv2yuvX_10LE_16_0_512_accurate_c:                    7874.1 ( 1.00x)
yuv2yuvX_10LE_16_0_512_accurate_neon:                  831.6 ( 9.47x)
yuv2yuvX_10LE_16_0_512_approximate_c:                 7918.1 ( 1.00x)
yuv2yuvX_10LE_16_0_512_approximate_neon:               836.1 ( 9.47x)
yuv2yuvX_10LE_16_16_512_accurate_c:                   7630.9 ( 1.00x)
yuv2yuvX_10LE_16_16_512_accurate_neon:                 804.5 ( 9.49x)
yuv2yuvX_10LE_16_16_512_approximate_c:                7724.7 ( 1.00x)
yuv2yuvX_10LE_16_16_512_approximate_neon:              808.6 ( 9.55x)
yuv2yuvX_10LE_16_32_512_accurate_c:                   7436.4 ( 1.00x)
yuv2yuvX_10LE_16_32_512_accurate_neon:                 780.4 ( 9.53x)
yuv2yuvX_10LE_16_32_512_approximate_c:                7366.7 ( 1.00x)
yuv2yuvX_10LE_16_32_512_approximate_neon:              780.5 ( 9.44x)
yuv2yuvX_10LE_16_48_512_accurate_c:                   7099.9 ( 1.00x)
yuv2yuvX_10LE_16_48_512_accurate_neon:                 761.0 ( 9.33x)
yuv2yuvX_10LE_16_48_512_approximate_c:                7097.6 ( 1.00x)
yuv2yuvX_10LE_16_48_512_approximate_neon:              754.6 ( 9.41x)

Benchmarked on:
Snapdragon(R) X Elite - X1E80100 - Qualcomm(R) Oryon(TM) CPU
3417 Mhz, 12 Core(s), 12 Logical Processor(s)
2025-08-12 09:05:00 +00:00
Timo Rothenpieler
262d41c804 all: fix typos found by codespell 2025-08-03 13:48:47 +02:00
Martin Storsjö
73f4668ef8 swscale: aarch64: Simplify the assignment of lumToYV12
We normally don't need else statements here; the common pattern
is to assign lower level SIMD implementations first, then
conditionally reassign higher level ones afterwards, if supported.

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-10 14:03:58 +02:00
Krzysztof Pyrkosz
d765e5f043 swscale/aarch64: dotprod implementation of rgba32_to_Y
The idea is to split the 16 bit coefficients into lower and upper half,
invoke udot for the lower half, shift by 8, and follow by udot for the
upper half.

Benchmark on A78:
bgra_to_y_128_c:                                       682.0 ( 1.00x)
bgra_to_y_128_neon:                                    181.2 ( 3.76x)
bgra_to_y_128_dotprod:                                 117.8 ( 5.79x)
bgra_to_y_1080_c:                                     5742.5 ( 1.00x)
bgra_to_y_1080_neon:                                  1472.5 ( 3.90x)
bgra_to_y_1080_dotprod:                                906.5 ( 6.33x)
bgra_to_y_1920_c:                                    10194.0 ( 1.00x)
bgra_to_y_1920_neon:                                  2589.8 ( 3.94x)
bgra_to_y_1920_dotprod:                               1573.8 ( 6.48x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-04 10:16:44 +02:00
Krzysztof Pyrkosz
38929b824b swscale/aarch64: Refactor hscale_16_to_15__fs_4
This patch removes the use of stack for temporary state and replaces
interleaved ld4 loads with ld1.

Before/after:

A78
hscale_16_to_15__fs_4_dstW_8_neon:                      86.8 ( 1.72x)
hscale_16_to_15__fs_4_dstW_24_neon:                    147.5 ( 2.73x)
hscale_16_to_15__fs_4_dstW_128_neon:                   614.0 ( 3.14x)
hscale_16_to_15__fs_4_dstW_144_neon:                   680.5 ( 3.18x)
hscale_16_to_15__fs_4_dstW_256_neon:                  1193.2 ( 3.19x)
hscale_16_to_15__fs_4_dstW_512_neon:                  2305.0 ( 3.27x)

hscale_16_to_15__fs_4_dstW_8_neon:                      86.0 ( 1.74x)
hscale_16_to_15__fs_4_dstW_24_neon:                    106.8 ( 3.78x)
hscale_16_to_15__fs_4_dstW_128_neon:                   404.0 ( 4.81x)
hscale_16_to_15__fs_4_dstW_144_neon:                   451.8 ( 4.80x)
hscale_16_to_15__fs_4_dstW_256_neon:                   760.5 ( 5.06x)
hscale_16_to_15__fs_4_dstW_512_neon:                  1520.0 ( 5.01x)

A72
hscale_16_to_15__fs_4_dstW_8_neon:                     156.8 ( 1.52x)
hscale_16_to_15__fs_4_dstW_24_neon:                    217.8 ( 2.52x)
hscale_16_to_15__fs_4_dstW_128_neon:                   906.8 ( 2.90x)
hscale_16_to_15__fs_4_dstW_144_neon:                  1014.5 ( 2.91x)
hscale_16_to_15__fs_4_dstW_256_neon:                  1751.5 ( 2.96x)
hscale_16_to_15__fs_4_dstW_512_neon:                  3469.3 ( 2.97x)

hscale_16_to_15__fs_4_dstW_8_neon:                     151.2 ( 1.54x)
hscale_16_to_15__fs_4_dstW_24_neon:                    173.4 ( 3.15x)
hscale_16_to_15__fs_4_dstW_128_neon:                   660.0 ( 3.98x)
hscale_16_to_15__fs_4_dstW_144_neon:                   735.7 ( 4.00x)
hscale_16_to_15__fs_4_dstW_256_neon:                  1273.5 ( 4.09x)
hscale_16_to_15__fs_4_dstW_512_neon:                  2488.2 ( 4.16x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-02 01:17:29 +02:00
Martin Storsjö
b137347278 aarch64: Fix a few misindented lines
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-28 23:23:09 +02:00
Krzysztof Pyrkosz
b92577405b swscale/aarch64/rgb2rgb_neon: Implemented {yuyv, uyvy}toyuv{420, 422}
A78:
uyvytoyuv420_neon:                                    6112.5 ( 6.96x)
uyvytoyuv422_neon:                                    6696.0 ( 6.32x)
yuyvtoyuv420_neon:                                    6113.0 ( 6.95x)
yuyvtoyuv422_neon:                                    6695.2 ( 6.31x)

A72:
uyvytoyuv420_neon:                                    9512.1 ( 6.09x)
uyvytoyuv422_neon:                                    9766.8 ( 6.32x)
yuyvtoyuv420_neon:                                    9639.1 ( 6.00x)
yuyvtoyuv422_neon:                                    9779.0 ( 6.03x)

A53:
uyvytoyuv420_neon:                                   12720.1 ( 9.10x)
uyvytoyuv422_neon:                                   14282.9 ( 6.71x)
yuyvtoyuv420_neon:                                   12637.4 ( 9.15x)
yuyvtoyuv422_neon:                                   14127.6 ( 6.77x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 11:39:42 +02:00
Krzysztof Pyrkosz
64107e22f5 swscale/aarch64/rgb24toyv12: skip early right shift by 2
It's a minor improvement that shaves off 5-8% from the execution time.
Instead of shifting by 2 right away and by 7 soon after, shift by 9 one
time.

Times before and after:

A78:
rgb24toyv12_16_200_neon:                              5366.8 ( 3.62x)
rgb24toyv12_128_60_neon:                             13574.0 ( 3.34x)
rgb24toyv12_512_16_neon:                             14463.8 ( 3.33x)
rgb24toyv12_1920_4_neon:                             13508.2 ( 3.34x)
rgb24toyv12_1920_4_negstride_neon:                   13525.0 ( 3.34x)

rgb24toyv12_16_200_neon:                              5293.8 ( 3.66x)
rgb24toyv12_128_60_neon:                             12955.0 ( 3.50x)
rgb24toyv12_512_16_neon:                             13784.0 ( 3.50x)
rgb24toyv12_1920_4_neon:                             12900.8 ( 3.49x)
rgb24toyv12_1920_4_negstride_neon:                   12902.8 ( 3.49x)

A72:
rgb24toyv12_16_200_neon:                              9695.8 ( 2.50x)
rgb24toyv12_128_60_neon:                             20286.6 ( 2.70x)
rgb24toyv12_512_16_neon:                             22276.6 ( 2.57x)
rgb24toyv12_1920_4_neon:                             19154.1 ( 2.77x)
rgb24toyv12_1920_4_negstride_neon:                   19055.1 ( 2.78x)

rgb24toyv12_16_200_neon:                              9214.8 ( 2.65x)
rgb24toyv12_128_60_neon:                             20731.5 ( 2.65x)
rgb24toyv12_512_16_neon:                             21145.0 ( 2.70x)
rgb24toyv12_1920_4_neon:                             17586.5 ( 2.99x)
rgb24toyv12_1920_4_negstride_neon:                   17571.0 ( 2.98x)

A53:
rgb24toyv12_16_200_neon:                             12880.4 ( 3.76x)
rgb24toyv12_128_60_neon:                             27776.3 ( 3.94x)
rgb24toyv12_512_16_neon:                             29411.3 ( 3.94x)
rgb24toyv12_1920_4_neon:                             27253.1 ( 3.98x)
rgb24toyv12_1920_4_negstride_neon:                   27474.3 ( 3.95x)

rgb24toyv12_16_200_neon:                             12196.3 ( 3.95x)
rgb24toyv12_128_60_neon:                             26943.1 ( 4.07x)
rgb24toyv12_512_16_neon:                             28642.3 ( 4.07x)
rgb24toyv12_1920_4_neon:                             26676.6 ( 4.08x)
rgb24toyv12_1920_4_negstride_neon:                   26713.8 ( 4.07x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 10:49:41 +02:00
Krzysztof Pyrkosz
c85a748979 swscale/aarch64/rgb2rgb: Implemented NEON shuf routines
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.

The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.

There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.

Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf

- A78
shuffle_bytes_0321_c:                                   75.5 ( 1.00x)
shuffle_bytes_0321_neon:                                26.5 ( 2.85x)
shuffle_bytes_1203_c:                                  136.2 ( 1.00x)
shuffle_bytes_1203_neon:                                27.2 ( 5.00x)
shuffle_bytes_1230_c:                                  135.5 ( 1.00x)
shuffle_bytes_1230_neon:                                28.0 ( 4.84x)
shuffle_bytes_2013_c:                                  138.8 ( 1.00x)
shuffle_bytes_2013_neon:                                22.0 ( 6.31x)
shuffle_bytes_2103_c:                                   76.5 ( 1.00x)
shuffle_bytes_2103_neon:                                20.5 ( 3.73x)
shuffle_bytes_2130_c:                                  137.5 ( 1.00x)
shuffle_bytes_2130_neon:                                28.0 ( 4.91x)
shuffle_bytes_3012_c:                                  138.2 ( 1.00x)
shuffle_bytes_3012_neon:                                21.5 ( 6.43x)
shuffle_bytes_3102_c:                                  138.2 ( 1.00x)
shuffle_bytes_3102_neon:                                27.2 ( 5.07x)
shuffle_bytes_3210_c:                                  138.0 ( 1.00x)
shuffle_bytes_3210_neon:                                22.0 ( 6.27x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  139.0 ( 1.00x)
shuffle_bytes_3210_neon:                                28.5 ( 4.88x)

- A72
shuffle_bytes_0321_c:                                  120.0 ( 1.00x)
shuffle_bytes_0321_neon:                                36.0 ( 3.33x)
shuffle_bytes_1203_c:                                  188.2 ( 1.00x)
shuffle_bytes_1203_neon:                                37.8 ( 4.99x)
shuffle_bytes_1230_c:                                  195.0 ( 1.00x)
shuffle_bytes_1230_neon:                                36.0 ( 5.42x)
shuffle_bytes_2013_c:                                  195.8 ( 1.00x)
shuffle_bytes_2013_neon:                                43.5 ( 4.50x)
shuffle_bytes_2103_c:                                  117.2 ( 1.00x)
shuffle_bytes_2103_neon:                                53.5 ( 2.19x)
shuffle_bytes_2130_c:                                  203.2 ( 1.00x)
shuffle_bytes_2130_neon:                                37.8 ( 5.38x)
shuffle_bytes_3012_c:                                  183.8 ( 1.00x)
shuffle_bytes_3012_neon:                                46.8 ( 3.93x)
shuffle_bytes_3102_c:                                  180.8 ( 1.00x)
shuffle_bytes_3102_neon:                                37.8 ( 4.79x)
shuffle_bytes_3210_c:                                  195.8 ( 1.00x)
shuffle_bytes_3210_neon:                                37.8 ( 5.19x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  194.8 ( 1.00x)
shuffle_bytes_3210_neon:                                30.8 ( 6.33x)

- x13s:
shuffle_bytes_0321_c:                                   49.4 ( 1.00x)
shuffle_bytes_0321_neon:                                18.1 ( 2.72x)
shuffle_bytes_1203_c:                                   98.4 ( 1.00x)
shuffle_bytes_1203_neon:                                18.4 ( 5.35x)
shuffle_bytes_1230_c:                                   97.4 ( 1.00x)
shuffle_bytes_1230_neon:                                19.1 ( 5.09x)
shuffle_bytes_2013_c:                                  101.4 ( 1.00x)
shuffle_bytes_2013_neon:                                16.9 ( 6.01x)
shuffle_bytes_2103_c:                                   53.9 ( 1.00x)
shuffle_bytes_2103_neon:                                13.9 ( 3.88x)
shuffle_bytes_2130_c:                                  100.9 ( 1.00x)
shuffle_bytes_2130_neon:                                19.1 ( 5.27x)
shuffle_bytes_3012_c:                                   97.4 ( 1.00x)
shuffle_bytes_3012_neon:                                17.1 ( 5.69x)
shuffle_bytes_3102_c:                                  100.9 ( 1.00x)
shuffle_bytes_3102_neon:                                19.1 ( 5.27x)
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                16.9 ( 5.96x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                18.6 ( 5.40x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:54:55 +02:00
Krzysztof Pyrkosz
e25a19fc7c swscale/aarch64/output.S: refactor ff_yuv2plane1_8_neon
The benchmarks (before vs after) were gathered using
./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 | grep yuv2yuv1

A78 before:
yuv2yuv1_0_512_accurate_c:                            2039.5 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          385.5 ( 5.29x)
yuv2yuv1_0_512_approximate_c:                         2110.5 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       385.5 ( 5.47x)
yuv2yuv1_3_512_accurate_c:                            2061.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          381.2 ( 5.41x)
yuv2yuv1_3_512_approximate_c:                         2099.2 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       381.2 ( 5.51x)
yuv2yuv1_8_512_accurate_c:                            2054.2 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          385.5 ( 5.33x)
yuv2yuv1_8_512_approximate_c:                         2112.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       385.5 ( 5.48x)
yuv2yuv1_11_512_accurate_c:                           2036.0 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         381.2 ( 5.34x)
yuv2yuv1_11_512_approximate_c:                        2115.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      381.2 ( 5.55x)
yuv2yuv1_16_512_accurate_c:                           2066.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         385.5 ( 5.36x)
yuv2yuv1_16_512_approximate_c:                        2100.8 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      385.5 ( 5.45x)
yuv2yuv1_19_512_accurate_c:                           2059.8 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         381.2 ( 5.40x)
yuv2yuv1_19_512_approximate_c:                        2102.8 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      381.2 ( 5.52x)

After:
yuv2yuv1_0_512_accurate_c:                            2206.0 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          139.2 (15.84x)
yuv2yuv1_0_512_approximate_c:                         2050.0 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       139.2 (14.72x)
yuv2yuv1_3_512_accurate_c:                            2205.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          138.0 (15.98x)
yuv2yuv1_3_512_approximate_c:                         2052.5 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       138.0 (14.87x)
yuv2yuv1_8_512_accurate_c:                            2171.0 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          139.2 (15.59x)
yuv2yuv1_8_512_approximate_c:                         2064.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       139.2 (14.82x)
yuv2yuv1_11_512_accurate_c:                           2164.8 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         138.0 (15.69x)
yuv2yuv1_11_512_approximate_c:                        2048.8 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      138.0 (14.85x)
yuv2yuv1_16_512_accurate_c:                           2154.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         139.2 (15.47x)
yuv2yuv1_16_512_approximate_c:                        2047.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      139.2 (14.70x)
yuv2yuv1_19_512_accurate_c:                           2144.5 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         138.0 (15.54x)
yuv2yuv1_19_512_approximate_c:                        2046.0 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      138.0 (14.83x)

A72 before:
yuv2yuv1_0_512_accurate_c:                            3779.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          527.8 ( 7.16x)
yuv2yuv1_0_512_approximate_c:                         4128.2 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       528.2 ( 7.81x)
yuv2yuv1_3_512_accurate_c:                            3836.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          527.0 ( 7.28x)
yuv2yuv1_3_512_approximate_c:                         3991.0 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       526.8 ( 7.58x)
yuv2yuv1_8_512_accurate_c:                            3732.8 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          525.5 ( 7.10x)
yuv2yuv1_8_512_approximate_c:                         4060.0 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       527.0 ( 7.70x)
yuv2yuv1_11_512_accurate_c:                           3836.2 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         530.0 ( 7.24x)
yuv2yuv1_11_512_approximate_c:                        4014.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      530.0 ( 7.57x)
yuv2yuv1_16_512_accurate_c:                           3726.2 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         525.5 ( 7.09x)
yuv2yuv1_16_512_approximate_c:                        4114.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      526.2 ( 7.82x)
yuv2yuv1_19_512_accurate_c:                           3812.2 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         530.0 ( 7.19x)
yuv2yuv1_19_512_approximate_c:                        4012.2 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      530.0 ( 7.57x)

After:
yuv2yuv1_0_512_accurate_c:                            3716.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          215.1 (17.28x)
yuv2yuv1_0_512_approximate_c:                         3877.8 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       222.8 (17.40x)
yuv2yuv1_3_512_accurate_c:                            3717.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          217.8 (17.06x)
yuv2yuv1_3_512_approximate_c:                         3801.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       220.3 (17.25x)
yuv2yuv1_8_512_accurate_c:                            3716.6 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          213.8 (17.38x)
yuv2yuv1_8_512_approximate_c:                         3831.8 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       218.1 (17.57x)
yuv2yuv1_11_512_accurate_c:                           3717.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         219.1 (16.97x)
yuv2yuv1_11_512_approximate_c:                        3801.6 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      216.1 (17.59x)
yuv2yuv1_16_512_accurate_c:                           3716.6 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         213.6 (17.40x)
yuv2yuv1_16_512_approximate_c:                        3831.6 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      215.1 (17.82x)
yuv2yuv1_19_512_accurate_c:                           3717.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         223.8 (16.61x)
yuv2yuv1_19_512_approximate_c:                        3801.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      219.1 (17.35x)

x13s before:
yuv2yuv1_0_512_accurate_c:                            1435.1 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          221.1 ( 6.49x)
yuv2yuv1_0_512_approximate_c:                         1405.4 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       219.1 ( 6.41x)
yuv2yuv1_3_512_accurate_c:                            1418.6 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          215.9 ( 6.57x)
yuv2yuv1_3_512_approximate_c:                         1405.9 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       224.1 ( 6.27x)
yuv2yuv1_8_512_accurate_c:                            1433.9 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          218.6 ( 6.56x)
yuv2yuv1_8_512_approximate_c:                         1412.9 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       218.9 ( 6.46x)
yuv2yuv1_11_512_accurate_c:                           1449.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         217.6 ( 6.66x)
yuv2yuv1_11_512_approximate_c:                        1410.9 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      221.1 ( 6.38x)
yuv2yuv1_16_512_accurate_c:                           1402.1 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         214.6 ( 6.53x)
yuv2yuv1_16_512_approximate_c:                        1422.4 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      222.9 ( 6.38x)
yuv2yuv1_19_512_accurate_c:                           1421.6 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         217.4 ( 6.54x)
yuv2yuv1_19_512_approximate_c:                        1421.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      221.4 ( 6.42x)

After:
yuv2yuv1_0_512_accurate_c:                            1413.6 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                           80.6 (17.53x)
yuv2yuv1_0_512_approximate_c:                         1455.6 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                        80.6 (18.05x)
yuv2yuv1_3_512_accurate_c:                            1429.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                           77.4 (18.47x)
yuv2yuv1_3_512_approximate_c:                         1462.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                        80.6 (18.14x)
yuv2yuv1_8_512_accurate_c:                            1425.4 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                           77.9 (18.30x)
yuv2yuv1_8_512_approximate_c:                         1436.6 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                        80.9 (17.76x)
yuv2yuv1_11_512_accurate_c:                           1429.4 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                          76.1 (18.78x)
yuv2yuv1_11_512_approximate_c:                        1447.1 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                       78.4 (18.46x)
yuv2yuv1_16_512_accurate_c:                           1439.9 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                          77.6 (18.55x)
yuv2yuv1_16_512_approximate_c:                        1422.1 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                       78.1 (18.20x)
yuv2yuv1_19_512_accurate_c:                           1447.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                          78.1 (18.52x)
yuv2yuv1_19_512_approximate_c:                        1474.4 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                       78.1 (18.87x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:05:06 +02:00
Ramiro Polla
ca889b1328 swscale/aarch64: add neon {lum,chr}ConvertRange16
aarch64 A55:
chrRangeFromJpeg16_1920_c:    32684.2
chrRangeFromJpeg16_1920_neon:  8431.2 (3.88x)
chrRangeToJpeg16_1920_c:      24996.8
chrRangeToJpeg16_1920_neon:    9395.0 (2.66x)
lumRangeFromJpeg16_1920_c:    17305.2
lumRangeFromJpeg16_1920_neon:  4586.5 (3.77x)
lumRangeToJpeg16_1920_c:      21144.8
lumRangeToJpeg16_1920_neon:    5069.8 (4.17x)

aarch64 A76:
chrRangeFromJpeg16_1920_c:    11523.8
chrRangeFromJpeg16_1920_neon:  3367.5 (3.42x)
chrRangeToJpeg16_1920_c:      11655.2
chrRangeToJpeg16_1920_neon:    4087.2 (2.85x)
lumRangeFromJpeg16_1920_c:     5762.0
lumRangeFromJpeg16_1920_neon:  1815.8 (3.17x)
lumRangeToJpeg16_1920_c:       5946.2
lumRangeToJpeg16_1920_neon:    2148.2 (2.77x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
6fe4a4ffb6 swscale/aarch64/range_convert: update neon range_convert functions to new API
aarch64 A55:
chrRangeFromJpeg8_1920_c:    28835.2 (1.00x)
chrRangeFromJpeg8_1920_neon:  5313.9 (5.43x)  5308.4 (5.43x)
chrRangeToJpeg8_1920_c:      23074.7 (1.00x)
chrRangeToJpeg8_1920_neon:    5551.3 (4.16x)  5549.2 (4.16x)
lumRangeFromJpeg8_1920_c:    15389.7 (1.00x)
lumRangeFromJpeg8_1920_neon:  3152.3 (4.88x)  3147.7 (4.89x)
lumRangeToJpeg8_1920_c:      19227.8 (1.00x)
lumRangeToJpeg8_1920_neon:    3628.7 (5.30x)  3630.2 (5.30x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:    6324.4 (1.00x)
chrRangeFromJpeg8_1920_neon: 2344.5 (2.70x) 2304.2 (2.74x)
chrRangeToJpeg8_1920_c:      9656.0 (1.00x)
chrRangeToJpeg8_1920_neon:   2824.2 (3.42x) 2794.2 (3.46x)
lumRangeFromJpeg8_1920_c:    4422.0 (1.00x)
lumRangeFromJpeg8_1920_neon: 1104.5 (4.00x) 1106.2 (4.00x)
lumRangeToJpeg8_1920_c:      5949.1 (1.00x)
lumRangeToJpeg8_1920_neon:   1329.8 (4.47x) 1328.2 (4.48x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
384fe39623 swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats
There is an issue with the constants used in YUV to YUV range conversion,
where the upper bound is not respected when converting to mpeg range.

With this commit, the constants are calculated at runtime, depending on
the bit depth. This approach also allows us to more easily understand how
the constants are derived.

For bit depths <= 14, the number of fixed point bits has been set to 14
for all conversions, to simplify the code.
For bit depths > 14, the number of fixed points bits has been raised and
set to 18, to allow for the conversion to be accurate enough for the mpeg
range to be respected.

The convert functions now take the conversion constants (coeff and offset)
as function arguments.
For bit depths <= 14, coeff is unsigned 16-bit and offset is 32-bit.
For bit depths > 14, coeff is unsigned 32-bit and offset is 64-bit.

x86_64:
chrRangeFromJpeg8_1920_c:    2127.4   2125.0  (1.00x)
chrRangeFromJpeg16_1920_c:   2325.2   2127.2  (1.09x)
chrRangeToJpeg8_1920_c:      3166.9   3168.7  (1.00x)
chrRangeToJpeg16_1920_c:     2152.4   3164.8  (0.68x)
lumRangeFromJpeg8_1920_c:    1263.0   1302.5  (0.97x)
lumRangeFromJpeg16_1920_c:   1080.5   1299.2  (0.83x)
lumRangeToJpeg8_1920_c:      1886.8   2112.2  (0.89x)
lumRangeToJpeg16_1920_c:     1077.0   1906.5  (0.56x)

aarch64 A55:
chrRangeFromJpeg8_1920_c:   28835.2  28835.6  (1.00x)
chrRangeFromJpeg16_1920_c:  28839.8  32680.8  (0.88x)
chrRangeToJpeg8_1920_c:     23074.7  23075.4  (1.00x)
chrRangeToJpeg16_1920_c:    17318.9  24996.0  (0.69x)
lumRangeFromJpeg8_1920_c:   15389.7  15384.5  (1.00x)
lumRangeFromJpeg16_1920_c:  15388.2  17306.7  (0.89x)
lumRangeToJpeg8_1920_c:     19227.8  19226.6  (1.00x)
lumRangeToJpeg16_1920_c:    15387.0  21146.3  (0.73x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:    6324.4   6268.1  (1.01x)
chrRangeFromJpeg16_1920_c:   6339.9  11521.5  (0.55x)
chrRangeToJpeg8_1920_c:      9656.0   9612.8  (1.00x)
chrRangeToJpeg16_1920_c:     6340.4  11651.8  (0.54x)
lumRangeFromJpeg8_1920_c:    4422.0   4420.8  (1.00x)
lumRangeFromJpeg16_1920_c:   4420.9   5762.0  (0.77x)
lumRangeToJpeg8_1920_c:      5949.1   5977.5  (1.00x)
lumRangeToJpeg16_1920_c:     4446.8   5946.2  (0.75x)

NOTE: all simd optimizations for range_convert have been disabled.
      they will be re-enabled when they are fixed for each architecture.

NOTE2: the same issue still exists in rgb2yuv conversions, which is not
       addressed in this commit.
2024-12-05 21:10:29 +01:00
Ramiro Polla
58bcdeb742 swscale/aarch64/range_convert: saturate output instead of limiting input
aarch64 A55:
chrRangeFromJpeg8_1920_c:    28836.2 (1.00x)
chrRangeFromJpeg8_1920_neon:  5312.6 (5.43x)  5313.9 (5.43x)
chrRangeToJpeg8_1920_c:      44196.2 (1.00x)
chrRangeToJpeg8_1920_neon:    6034.6 (7.32x)  5551.3 (7.96x)
lumRangeFromJpeg8_1920_c:    15388.5 (1.00x)
lumRangeFromJpeg8_1920_neon:  3150.7 (4.88x)  3152.3 (4.88x)
lumRangeToJpeg8_1920_c:      23069.7 (1.00x)
lumRangeToJpeg8_1920_neon:    3873.2 (5.96x)  3628.7 (6.36x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:     6334.7 (1.00x)
chrRangeFromJpeg8_1920_neon:  2264.5 (2.80x)  2344.5 (2.70x)
chrRangeToJpeg8_1920_c:      11474.5 (1.00x)
chrRangeToJpeg8_1920_neon:    2646.5 (4.34x)  2824.2 (4.06x)
lumRangeFromJpeg8_1920_c:     4453.2 (1.00x)
lumRangeFromJpeg8_1920_neon:  1104.8 (4.03x)  1104.5 (4.03x)
lumRangeToJpeg8_1920_c:       6645.0 (1.00x)
lumRangeToJpeg8_1920_neon:    1310.5 (5.07x)  1329.8 (5.00x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
2d1358a84d swscale/range_convert: saturate output instead of limiting input
For bit depths <= 14, the result is saturated to 15 bits.
For bit depths > 14, the result is saturated to 19 bits.

x86_64:
chrRangeFromJpeg8_1920_c:    2126.5   2127.4  (1.00x)
chrRangeFromJpeg16_1920_c:   2331.4   2325.2  (1.00x)
chrRangeToJpeg8_1920_c:      3163.0   3166.9  (1.00x)
chrRangeToJpeg16_1920_c:     3163.7   2152.4  (1.47x)
lumRangeFromJpeg8_1920_c:    1262.2   1263.0  (1.00x)
lumRangeFromJpeg16_1920_c:   1079.5   1080.5  (1.00x)
lumRangeToJpeg8_1920_c:      1860.5   1886.8  (0.99x)
lumRangeToJpeg16_1920_c:     1910.2   1077.0  (1.77x)

aarch64 A55:
chrRangeFromJpeg8_1920_c:   28836.2  28835.2  (1.00x)
chrRangeFromJpeg16_1920_c:  28840.1  28839.8  (1.00x)
chrRangeToJpeg8_1920_c:     44196.2  23074.7  (1.92x)
chrRangeToJpeg16_1920_c:    36527.3  17318.9  (2.11x)
lumRangeFromJpeg8_1920_c:   15388.5  15389.7  (1.00x)
lumRangeFromJpeg16_1920_c:  15389.3  15388.2  (1.00x)
lumRangeToJpeg8_1920_c:     23069.7  19227.8  (1.20x)
lumRangeToJpeg16_1920_c:    19227.8  15387.0  (1.25x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:    6334.7   6324.4  (1.00x)
chrRangeFromJpeg16_1920_c:   6336.0   6339.9  (1.00x)
chrRangeToJpeg8_1920_c:     11474.5   9656.0  (1.19x)
chrRangeToJpeg16_1920_c:     9640.5   6340.4  (1.52x)
lumRangeFromJpeg8_1920_c:    4453.2   4422.0  (1.01x)
lumRangeFromJpeg16_1920_c:   4414.2   4420.9  (1.00x)
lumRangeToJpeg8_1920_c:      6645.0   5949.1  (1.12x)
lumRangeToJpeg16_1920_c:     6005.2   4446.8  (1.35x)

NOTE: all simd optimizations for range_convert have been disabled
      except for x86, which already had the same behaviour.
      they will be re-enabled when they are fixed for each architecture.
2024-12-05 21:10:29 +01:00
Niklas Haas
2d077f9acd swscale/internal: group user-facing options together
This is a preliminary step to separating these into a new struct. This
commit contains no functional changes, it is a pure search-and-replace.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-11-21 12:49:56 +01:00
Ramiro Polla
f7ee0195df swscale/range_convert: drop redundant conditionals from arch-specific init functions
These conditions are already checked for in the main init function.
2024-10-27 13:20:56 +01:00
Ramiro Polla
7728b3357d swscale/range_convert: call arch-specific init functions from main init function
This commit also fixes the issue that the call to ff_sws_init_range_convert()
from sws_init_swscale() was not setting up the arch-specific optimizations.
2024-10-27 13:20:56 +01:00
Niklas Haas
67adb30322 swscale: rename SwsContext to SwsInternal
And preserve the public SwsContext as separate name. The motivation here
is that I want to turn SwsContext into a public struct, while keeping the
internal implementation hidden. Additionally, I also want to be able to
use multiple internal implementations, e.g. for GPU devices.

This commit does not include any functional changes. For the most part, it is
a simple rename. The only complications arise from the public facing API
functions, which preserve their current type (and hence require an additional
unwrapping step internally), and the checkasm test framework, which directly
accesses SwsInternal.

For consistency, the affected functions that need to maintain a distionction
have generally been changed to refer to the SwsContext as *sws, and the
SwsInternal as *c.

In an upcoming commit, I will provide a backing definition for the public
SwsContext, and update `sws_internal()` to dereference the internal struct
instead of merely casting it.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-10-24 22:50:00 +02:00
Martin Storsjö
b9145fcab2 swscale: Fix aarch64 and i386 compilation failures
This unbreaks builds after c1a0e65763,
which broke with errors like

src/libswscale/aarch64/rgb2rgb.c:66:25: error: incompatible function pointer types assigning to 'void (*)(const uint8_t *, uint8_t *, uint8_t *, uint8_t *, int, int, int, int, int, const int32_t *)' (aka 'void (*)(const unsigned char *, unsigned char *, unsigned char *, unsigned char *, int, int, int, int, int, const int *)') from 'void (const uint8_t *, uint8_t *, uint8_t *, uint8_t *, int, int, int, int, int, int32_t *)' (aka 'void (const unsigned char *, unsigned char *, unsigned char *, unsigned char *, int, int, int, int, int, int *)') [-Wincompatible-function-pointer-types]
   66 |         ff_rgb24toyv12  = rgb24toyv12;
      |                         ^ ~~~~~~~~~~~

and

src/libswscale/aarch64/swscale_unscaled.c:213:29: error: incompatible function pointer types assigning to 'SwsFunc' (aka 'int (*)(struct SwsContext *, const unsigned char *const *, const int *, int, int, unsigned char *const *, const int *)') from 'int (SwsContext *, const uint8_t *const *, const int *, int, int, const uint8_t **, const int *)' (aka 'int (struct SwsContext *, const unsigned char *const *, const int *, int, int, const unsigned char **, const int *)') [-Wincompatible-function-pointer-types]
  213 |         c->convert_unscaled = nv24_to_yuv420p_neon_wrapper;
      |                             ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-10-08 09:29:07 +03:00
Niklas Haas
c1a0e65763 swscale/internal: constify SwsFunc
I want to move away from having random leaf processing functions mutate
plane pointers, and while we're at it, we might as well make the strides
and tables const as well.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-10-07 19:51:34 +02:00
Zhao Zhili
e18b46d95f swscale/aarch64: Fix rgb24toyv12 only works with aligned width
Since c0666d8b, rgb24toyv12 is broken for width non-aligned to 16.
Add a simple wrapper to handle the non-aligned part.

Co-authored-by: johzzy <hellojinqiang@gmail.com>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-09-24 10:24:14 +08:00
Ramiro Polla
c0666d8bed swscale/aarch64/rgb2rgb: add neon implementation for rgb24toyv12
A55               A76
rgb24toyv12_16_200_c:     36890.6           17275.5
rgb24toyv12_16_200_neon:  12460.1 ( 2.96x)   5360.8 ( 3.22x)
rgb24toyv12_128_60_c:     83205.1           39884.8
rgb24toyv12_128_60_neon:  27468.4 ( 3.03x)  13552.5 ( 2.94x)
rgb24toyv12_512_16_c:     88111.6           42346.8
rgb24toyv12_512_16_neon:  29126.6 ( 3.03x)  14411.2 ( 2.94x)
rgb24toyv12_1920_4_c:     82068.1           39620.0
rgb24toyv12_1920_4_neon:  27011.6 ( 3.04x)  13492.2 ( 2.94x)
2024-09-06 23:11:13 +02:00
Ramiro Polla
d8848325a6 swscale/aarch64/rgb2rgb: add deinterleaveBytes neon implementation
A55               A76
deinterleave_bytes_c:             70342.0           34497.5
deinterleave_bytes_neon:          21594.5 ( 3.26x)   5535.2 ( 6.23x)
deinterleave_bytes_aligned_c:     71340.8           34651.2
deinterleave_bytes_aligned_neon:   8616.8 ( 8.28x)   3996.2 ( 8.67x)
2024-09-06 23:05:09 +02:00
Ramiro Polla
420d443600 swscale/aarch64: cosmetics fix (spaces inside curly braces) 2024-08-26 11:07:49 +02:00
Ramiro Polla
52887683e9 swscale/aarch64: add nv24/nv42 to yuv420p unscaled converter
A55               A76
nv24_yuv420p_128_c:       4956.1            1267.0
nv24_yuv420p_128_neon:    3109.1 ( 1.59x)    640.0 ( 1.98x)
nv24_yuv420p_1920_c:     35728.4           11736.2
nv24_yuv420p_1920_neon:   8011.1 ( 4.46x)   2436.0 ( 4.82x)
nv42_yuv420p_128_c:       4956.4            1270.5
nv42_yuv420p_128_neon:    3074.6 ( 1.61x)    639.5 ( 1.99x)
nv42_yuv420p_1920_c:     35685.9           11732.5
nv42_yuv420p_1920_neon:   7995.1 ( 4.46x)   2437.2 ( 4.81x)
2024-08-26 11:04:46 +02:00
Martin Storsjö
cfe0a36352 libswscale: aarch64: Fix the indentation of some macro invocations
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-08-22 14:40:30 +03:00
Ramiro Polla
181cd260db swscale/aarch64/yuv2rgb: add neon yuv42{0,2}p -> gbrp unscaled colorspace converters
checkasm --bench on a Raspberry Pi 5 Model B Rev 1.0:
yuv420p_gbrp_128_c: 1243.0
yuv420p_gbrp_128_neon: 453.5
yuv420p_gbrp_1920_c: 18165.5
yuv420p_gbrp_1920_neon: 6700.0
yuv422p_gbrp_128_c: 1463.5
yuv422p_gbrp_128_neon: 471.5
yuv422p_gbrp_1920_c: 21343.7
yuv422p_gbrp_1920_neon: 6743.5
2024-08-18 22:26:17 +02:00
Zhao Zhili
4d90a76986 swscale/aarch64: Add argb/abgr to yuv
Test on Apple M1 with kperf:
				: -O3		: -O3 -fno-vectorize
abgr_to_uv_8_c			: 19.4		: 26.1
abgr_to_uv_8_neon		: 29.9		: 51.1
abgr_to_uv_128_c		: 146.4		: 558.9
abgr_to_uv_128_neon		: 85.1		: 83.4
abgr_to_uv_1080_c		: 1162.6	: 4786.4
abgr_to_uv_1080_neon		: 819.6		: 826.6
abgr_to_uv_1920_c		: 2063.6	: 8492.1
abgr_to_uv_1920_neon		: 1435.1	: 1447.1
abgr_to_uv_half_8_c		: 16.4		: 11.4
abgr_to_uv_half_8_neon		: 35.6		: 20.4
abgr_to_uv_half_128_c		: 108.6		: 359.4
abgr_to_uv_half_128_neon	: 75.4		: 42.6
abgr_to_uv_half_1080_c		: 883.4		: 2885.6
abgr_to_uv_half_1080_neon	: 460.6		: 481.1
abgr_to_uv_half_1920_c		: 1553.6	: 5106.9
abgr_to_uv_half_1920_neon	: 817.6		: 820.4
abgr_to_y_8_c			: 6.1		: 26.4
abgr_to_y_8_neon		: 40.6		: 6.4
abgr_to_y_128_c			: 99.9		: 390.1
abgr_to_y_128_neon		: 67.4		: 55.9
abgr_to_y_1080_c		: 735.9		: 3170.4
abgr_to_y_1080_neon		: 534.6		: 536.6
abgr_to_y_1920_c		: 1279.4	: 6016.4
abgr_to_y_1920_neon		: 932.6		: 927.6

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Zhao Zhili
52422133ae swscale/aarch64: Add bgra/rgba to yuv
Test on Apple M1 with kperf
				: -O3		: -O3 -fno-vectorize
bgra_to_uv_8_c			: 13.4		: 27.5
bgra_to_uv_8_neon		: 37.4		: 41.7
bgra_to_uv_128_c		: 155.9		: 550.2
bgra_to_uv_128_neon		: 91.7		: 92.7
bgra_to_uv_1080_c		: 1173.2	: 4558.2
bgra_to_uv_1080_neon		: 822.7		: 809.5
bgra_to_uv_1920_c		: 2078.2	: 8115.2
bgra_to_uv_1920_neon		: 1437.7	: 1438.7
bgra_to_uv_half_8_c		: 17.9		: 14.2
bgra_to_uv_half_8_neon		: 37.4		: 10.5
bgra_to_uv_half_128_c		: 103.9		: 326.0
bgra_to_uv_half_128_neon	: 73.9		: 68.7
bgra_to_uv_half_1080_c		: 850.2		: 3732.0
bgra_to_uv_half_1080_neon	: 484.2		: 490.0
bgra_to_uv_half_1920_c		: 1479.2	: 4942.7
bgra_to_uv_half_1920_neon	: 824.2		: 824.7
bgra_to_y_8_c			: 8.2		: 29.5
bgra_to_y_8_neon		: 18.2		: 32.7
bgra_to_y_128_c			: 101.4		: 361.5
bgra_to_y_128_neon		: 74.9		: 73.7
bgra_to_y_1080_c		: 739.4		: 3018.0
bgra_to_y_1080_neon		: 613.4		: 544.2
bgra_to_y_1920_c		: 1298.7	: 5326.0
bgra_to_y_1920_neon		: 918.7		: 934.2

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Zhao Zhili
b8b71be07a swscale/aarch64: Add bgr24 to yuv
Test on Apple M1 with kperf
				: -O3		: -O3 -fno-vectorize
bgr24_to_uv_8_c			: 28.5		: 52.5
bgr24_to_uv_8_neon		: 54.5		: 59.7
bgr24_to_uv_128_c		: 294.0		: 830.7
bgr24_to_uv_128_neon		: 99.7		: 112.0
bgr24_to_uv_1080_c		: 965.0		: 6624.0
bgr24_to_uv_1080_neon		: 751.5		: 754.7
bgr24_to_uv_1920_c		: 1693.2	: 11554.5
bgr24_to_uv_1920_neon		: 1292.5	: 1307.5
bgr24_to_uv_half_8_c		: 54.2		: 37.0
bgr24_to_uv_half_8_neon		: 27.2		: 22.5
bgr24_to_uv_half_128_c		: 127.2		: 392.5
bgr24_to_uv_half_128_neon	: 63.0		: 52.0
bgr24_to_uv_half_1080_c		: 880.2		: 3329.0
bgr24_to_uv_half_1080_neon	: 401.5		: 390.7
bgr24_to_uv_half_1920_c		: 1585.7	: 6390.7
bgr24_to_uv_half_1920_neon	: 694.7		: 698.7
bgr24_to_y_8_c			: 21.7		: 22.5
bgr24_to_y_8_neon		: 797.2		: 25.5
bgr24_to_y_128_c		: 88.0		: 280.5
bgr24_to_y_128_neon		: 63.7		: 55.0
bgr24_to_y_1080_c		: 616.7		: 2208.7
bgr24_to_y_1080_neon		: 900.0		: 452.0
bgr24_to_y_1920_c		: 1093.2	: 3894.7
bgr24_to_y_1920_neon		: 777.2		: 767.5

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Ramiro Polla
75f1a8e071 swscale/aarch64: add neon {lum,chr}ConvertRange
chrRangeFromJpeg_8_c: 29.2
chrRangeFromJpeg_8_neon: 19.5
chrRangeFromJpeg_24_c: 80.5
chrRangeFromJpeg_24_neon: 34.0
chrRangeFromJpeg_128_c: 413.7
chrRangeFromJpeg_128_neon: 156.0
chrRangeFromJpeg_144_c: 471.0
chrRangeFromJpeg_144_neon: 174.2
chrRangeFromJpeg_256_c: 842.0
chrRangeFromJpeg_256_neon: 305.5
chrRangeFromJpeg_512_c: 1699.0
chrRangeFromJpeg_512_neon: 608.0
chrRangeToJpeg_8_c: 51.7
chrRangeToJpeg_8_neon: 22.7
chrRangeToJpeg_24_c: 149.7
chrRangeToJpeg_24_neon: 38.0
chrRangeToJpeg_128_c: 761.7
chrRangeToJpeg_128_neon: 176.7
chrRangeToJpeg_144_c: 866.2
chrRangeToJpeg_144_neon: 198.7
chrRangeToJpeg_256_c: 1516.5
chrRangeToJpeg_256_neon: 348.7
chrRangeToJpeg_512_c: 3067.2
chrRangeToJpeg_512_neon: 692.7
lumRangeFromJpeg_8_c: 24.0
lumRangeFromJpeg_8_neon: 17.0
lumRangeFromJpeg_24_c: 56.7
lumRangeFromJpeg_24_neon: 21.0
lumRangeFromJpeg_128_c: 294.5
lumRangeFromJpeg_128_neon: 76.7
lumRangeFromJpeg_144_c: 332.5
lumRangeFromJpeg_144_neon: 86.7
lumRangeFromJpeg_256_c: 586.0
lumRangeFromJpeg_256_neon: 152.2
lumRangeFromJpeg_512_c: 1190.0
lumRangeFromJpeg_512_neon: 298.0
lumRangeToJpeg_8_c: 31.7
lumRangeToJpeg_8_neon: 19.5
lumRangeToJpeg_24_c: 83.5
lumRangeToJpeg_24_neon: 24.2
lumRangeToJpeg_128_c: 440.5
lumRangeToJpeg_128_neon: 91.0
lumRangeToJpeg_144_c: 504.2
lumRangeToJpeg_144_neon: 101.0
lumRangeToJpeg_256_c: 879.7
lumRangeToJpeg_256_neon: 177.2
lumRangeToJpeg_512_c: 1794.2
lumRangeToJpeg_512_neon: 354.0
2024-06-18 23:12:41 +02:00
Zhao Zhili
9dac8495b0 swscale/aarch64: Add rgb24 to yuv implementation
Test on Apple M1:

rgb24_to_uv_8_c: 0.0
rgb24_to_uv_8_neon: 0.2
rgb24_to_uv_128_c: 1.0
rgb24_to_uv_128_neon: 0.5
rgb24_to_uv_1080_c: 7.0
rgb24_to_uv_1080_neon: 5.7
rgb24_to_uv_1920_c: 12.5
rgb24_to_uv_1920_neon: 9.5
rgb24_to_uv_half_8_c: 0.2
rgb24_to_uv_half_8_neon: 0.2
rgb24_to_uv_half_128_c: 1.0
rgb24_to_uv_half_128_neon: 0.5
rgb24_to_uv_half_1080_c: 6.2
rgb24_to_uv_half_1080_neon: 3.0
rgb24_to_uv_half_1920_c: 11.2
rgb24_to_uv_half_1920_neon: 5.2
rgb24_to_y_8_c: 0.2
rgb24_to_y_8_neon: 0.0
rgb24_to_y_128_c: 0.5
rgb24_to_y_128_neon: 0.5
rgb24_to_y_1080_c: 4.7
rgb24_to_y_1080_neon: 3.2
rgb24_to_y_1920_c: 8.0
rgb24_to_y_1920_neon: 5.7

On Pixel 6:

rgb24_to_uv_8_c: 30.7
rgb24_to_uv_8_neon: 56.9
rgb24_to_uv_128_c: 213.9
rgb24_to_uv_128_neon: 173.2
rgb24_to_uv_1080_c: 1649.9
rgb24_to_uv_1080_neon: 1424.4
rgb24_to_uv_1920_c: 2907.9
rgb24_to_uv_1920_neon: 2480.7
rgb24_to_uv_half_8_c: 36.2
rgb24_to_uv_half_8_neon: 33.4
rgb24_to_uv_half_128_c: 167.9
rgb24_to_uv_half_128_neon: 99.4
rgb24_to_uv_half_1080_c: 1293.9
rgb24_to_uv_half_1080_neon: 778.7
rgb24_to_uv_half_1920_c: 2292.7
rgb24_to_uv_half_1920_neon: 1328.7
rgb24_to_y_8_c: 19.7
rgb24_to_y_8_neon: 27.7
rgb24_to_y_128_c: 129.9
rgb24_to_y_128_neon: 96.7
rgb24_to_y_1080_c: 995.4
rgb24_to_y_1080_neon: 767.7
rgb24_to_y_1920_c: 1747.4
rgb24_to_y_1920_neon: 1337.2

Note both tests use clang as compiler, which has vectorization
enabled by default with -O3.

Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-06-11 01:12:09 +08:00
xufuji456
cc86343b96 lavc/hevcdsp_qpel_neon: using movi.16b instead of movi.2d
Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b"

Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-11-28 15:54:49 +02:00
Martin Storsjö
a76b409dd0 aarch64: Reindent all assembly to 8/24 column indentation
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:54 +03:00