13 Commits

Author SHA1 Message Date
Niklas Haas
e20a32d730 swscale/x86/ops: align linear kernels with reference backend
See previous commit.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-16 23:24:55 +02:00
Niklas Haas
af2674645f swscale/ops: drop offset from SWS_MASK_ALPHA
This is far more commonly used without an offset than with; so having it there
prevents these special cases from actually doing much good.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-16 23:24:55 +02:00
Niklas Haas
526195e0a3 swscale/x86/ops_float: fix typo in linear_row
First vector is %2, not %3. This was never triggered before because all of
the existing masks never hit this exact case.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-04-16 23:24:55 +02:00
Niklas Haas
2ef01689c4 swscale/x86/ops: add 4x4 transposed kernel for large filters
Above a certain filter size, we can load the offsets as scalars and loop
over filter taps instead. To avoid having to assemble the output register
in memory (or use some horrific sequence of blends and insertions), we process
4 adjacent pixels at a time and do a 4x4 transpose before accumulating the
weights.

Significantly faster than the existing kernels after 2-3 iterations.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-03-28 18:50:14 +01:00
Niklas Haas
4bf51d6615 swscale/x86/ops: add reference SWS_OP_FILTER_H implementation
This uses a naive gather-based loop, similar to the existing legacy hscale
SIMD. This has provably correct semantics (and avoids overflow as long as
the filter scale is 1 << 14 or so), though it's not particularly fast for
larger filter sizes.

We can specialize this to more efficient implementations in a subset of cases,
but for now, this guarantees a match to the C code.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2026-03-28 18:50:14 +01:00
Niklas Haas
568cdca9cc swscale/x86/ops: implement support for SWS_OP_FILTER_V
Ideally, we would like to be able to specialize these to fixed kernel
sizes as well (e.g. 2 taps), but that only saves a tiny bit of loop overhead
and at the moment I have more pressing things to focus on.

I found that using FMA instead of straight mulps/addps gains about 15%, so
I defined a separate FMA path that can be used when BITEXACT is not specified
(or when we can statically guarantee that the final sum fits into the floating
point range).

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2026-03-28 18:50:14 +01:00
Niklas Haas
baac4a1174 swscale/x86/ops: add section comments (cosmetic)
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2026-03-12 21:02:48 +00:00
Niklas Haas
ce096aa4ee swscale/x86/ops: add support for optional dither indices
Instead of defining multiple patterns for the dither ops, just define a
single generic function that branches internally. The branch is well-predicted
and ridiculously cheap. At least on my end, within margin of error.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-02-26 13:09:14 +00:00
Niklas Haas
48ab318f5c swscale/x86/ops: don't preload dither weights
This doesn't actually gain any performance but makes the code needlessly
complicated. Just directly add the indirect address as needed.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-02-26 13:09:14 +00:00
Niklas Haas
1ec8e6e3ce swscale/x86/ops: split off dither0 special case
I want to rewrite the dither kernel a bit, and this special case is a bit
too annoying and gets in the way.

Signed-off-by: Niklas Haas <git@haasn.dev>
2026-02-26 13:09:14 +00:00
Niklas Haas
3f7e3cedb5 swscale/x86/ops_float: store and load per row dither offset directly
Instead of computing y + N with a hard-coded index offset, calculate the
relative offset as a 16-bit integer in C and add that to the pointer directly.
Since we no longer mask the resulting combined address, this may result in
overread, but that's fine since we over-provisioned the array in the previous
commit.
2025-12-15 14:31:58 +00:00
Niklas Haas
b1c96b99fa swscale/x86/ops_float: remove special case for 2x2 matrix
This is an exceptionally unlikely (in fact, currently impossible) case to
actually hit, and not worth micro-optimizing for. More specifically, having
this special case prevents me from easily adding per-row offsets.
2025-12-15 14:31:58 +00:00
Niklas Haas
982d3a98d0 swscale/x86: add SIMD backend
This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
floating point operations. While this is not yet 100% coverage, it's good
enough for the vast majority of formats out there.

Of special note is the packed shuffle fast path, which uses pshufb at vector
sizes up to AVX512.
2025-09-01 19:28:36 +02:00