FFmpeg

mirror of https://mirror.skon.top/https://github.com/FFmpeg/FFmpeg synced 2026-04-21 21:30:24 +08:00

Files

David Christle 2c7fe8d8ad swscale/aarch64: add NEON rgb32tobgr24 and rgb24tobgr32 conversions

Add NEON alpha drop/insert using ldp+tbl+stp instead of ld4/st3 and
ld3/st4 structure operations. Both use a 2-register sliding-window
tbl with post-indexed addressing. Instruction scheduling targets
narrow in-order cores (A55) while remaining neutral on wide OoO.

Scalar tails use coalesced loads/stores (ldr+strh+lsr+strb for alpha
drop, ldrh+ldrb+orr+str for alpha insert) to reduce per-pixel
instruction count. Independent instructions placed between loads and
dependent operations to fill load-use latency on in-order cores.

checkasm --bench on Apple M3 Max (decicycles, 1920px):
  rgb32tobgr24_c:    114.4 ( 1.00x)
  rgb32tobgr24_neon:  64.3 ( 1.78x)
  rgb24tobgr32_c:    128.9 ( 1.00x)
  rgb24tobgr32_neon:  80.9 ( 1.59x)

C baseline is clang auto-vectorized; speedup is over compiler NEON.

Signed-off-by: David Christle <dev@christle.is>

2026-03-04 10:30:08 +00:00

asm-offsets.h

swscale: Add AArch64 Neon path for xyz12Torgb48 LE

2025-12-05 10:28:18 +00:00

hscale.S

all: fix typos found by codespell

2025-08-03 13:48:47 +02:00

input.S

swscale/aarch64: dotprod implementation of rgba32_to_Y

2025-03-04 10:16:44 +02:00

Makefile

swscale: Add AArch64 Neon path for xyz12Torgb48 LE