Files
David Christle 2c7fe8d8ad swscale/aarch64: add NEON rgb32tobgr24 and rgb24tobgr32 conversions
Add NEON alpha drop/insert using ldp+tbl+stp instead of ld4/st3 and
ld3/st4 structure operations. Both use a 2-register sliding-window
tbl with post-indexed addressing. Instruction scheduling targets
narrow in-order cores (A55) while remaining neutral on wide OoO.

Scalar tails use coalesced loads/stores (ldr+strh+lsr+strb for alpha
drop, ldrh+ldrb+orr+str for alpha insert) to reduce per-pixel
instruction count. Independent instructions placed between loads and
dependent operations to fill load-use latency on in-order cores.

checkasm --bench on Apple M3 Max (decicycles, 1920px):
  rgb32tobgr24_c:    114.4 ( 1.00x)
  rgb32tobgr24_neon:  64.3 ( 1.78x)
  rgb24tobgr32_c:    128.9 ( 1.00x)
  rgb24tobgr32_neon:  80.9 ( 1.59x)

C baseline is clang auto-vectorized; speedup is over compiler NEON.

Signed-off-by: David Christle <dev@christle.is>
2026-03-04 10:30:08 +00:00
..
2025-08-03 13:48:47 +02:00