This is far more commonly used without an offset than with; so having it there
prevents these special cases from actually doing much good.
Signed-off-by: Niklas Haas <git@haasn.dev>
First vector is %2, not %3. This was never triggered before because all of
the existing masks never hit this exact case.
Signed-off-by: Niklas Haas <git@haasn.dev>
Above a certain filter size, we can load the offsets as scalars and loop
over filter taps instead. To avoid having to assemble the output register
in memory (or use some horrific sequence of blends and insertions), we process
4 adjacent pixels at a time and do a 4x4 transpose before accumulating the
weights.
Significantly faster than the existing kernels after 2-3 iterations.
Signed-off-by: Niklas Haas <git@haasn.dev>
This uses a naive gather-based loop, similar to the existing legacy hscale
SIMD. This has provably correct semantics (and avoids overflow as long as
the filter scale is 1 << 14 or so), though it's not particularly fast for
larger filter sizes.
We can specialize this to more efficient implementations in a subset of cases,
but for now, this guarantees a match to the C code.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Ideally, we would like to be able to specialize these to fixed kernel
sizes as well (e.g. 2 taps), but that only saves a tiny bit of loop overhead
and at the moment I have more pressing things to focus on.
I found that using FMA instead of straight mulps/addps gains about 15%, so
I defined a separate FMA path that can be used when BITEXACT is not specified
(or when we can statically guarantee that the final sum fits into the floating
point range).
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of defining multiple patterns for the dither ops, just define a
single generic function that branches internally. The branch is well-predicted
and ridiculously cheap. At least on my end, within margin of error.
Signed-off-by: Niklas Haas <git@haasn.dev>
This doesn't actually gain any performance but makes the code needlessly
complicated. Just directly add the indirect address as needed.
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead of computing y + N with a hard-coded index offset, calculate the
relative offset as a 16-bit integer in C and add that to the pointer directly.
Since we no longer mask the resulting combined address, this may result in
overread, but that's fine since we over-provisioned the array in the previous
commit.
This is an exceptionally unlikely (in fact, currently impossible) case to
actually hit, and not worth micro-optimizing for. More specifically, having
this special case prevents me from easily adding per-row offsets.
This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
floating point operations. While this is not yet 100% coverage, it's good
enough for the vast majority of formats out there.
Of special note is the packed shuffle fast path, which uses pshufb at vector
sizes up to AVX512.