Makes various pieces of code that expect to get a SWS_OP_READ more robust,
and also allows us to generalize to introduce more input op types in the
future (in particular, I am looking ahead towards filter ops).
Signed-off-by: Niklas Haas <git@haasn.dev>
This code is self-contained and logically distinct from the ops-related
helpers in ops.c, so it belongs in its own file.
Purely cosmetic; no functional change.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
AVFrame just really doesn't have the semantics we want. However, there a
tangible benefit to having SwsFrame act as a carbon copy of a (subset of)
AVFrame.
This partially reverts commit 67f3627267.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This has now become fully redundant with AVFrame, especially since the
existence of SwsPassBuffer. Delete it, simplifying a lot of things and
avoiding reinventing the wheel everywhere.
Also generally reduces overhead, since there is less redundant copying
going on.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
The current logic didn't take into account the possible plane shift. Just
re-compute the correctly shifted pointers using the row position.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Instead, precompute the correctly swizzled data and stride in setup()
and just reference the SwsOpExec fields directly.
To avoid the stack copies in handle_tail() we can introduce a temporary
array to hold just the pointers.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
We often need to dither only a subset of the components. Previously this
was not possible, but we can just use the special value -1 for this.
The main motivating factor is actually the fact that "unnecessary" dither ops
would otherwise frequently prevent plane splitting, since e.g. a copied
alpha plane has to come along for the ride through the whole F32/dither
pipeline.
Additionally, it somewhat simplifies implementations.
Signed-off-by: Niklas Haas <git@haasn.dev>
This gives more information about each operation and helps catch issues
earlier on.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Instead of awkwardly preserving these from the `SwsOp` itself. This
interpretation lessens the risk of bugs as a result of changing the plane
swizzle mask without updating the corresponding components.
After this commit, the plane swizzle mask is automatically taken into
account; i.e. the src_comps mask is always interpreted as if the read op
was in-order (unswizzled).
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This helper function now also takes into account the plane order, and only
returns true if the SwsOpList is a true no-op (i.e. the input image may be
exactly ref'd to the output, with no change in plane order, etc.)
As pointed out in the code, this is unlikely to actually matter, but still
technically correct.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This can be used to have the execution code directly swizzle the plane
pointers, instead of swizzling the data via SWS_OP_SWIZZLE. This can be used
to, for example, extract a subset of the input/output planes for partial
processing of split graphs (e.g. subsampled chroma, or independent alpha),
or just to skip an SWS_OP_SWIZZLE operation.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This optimization is lossy, since it removes important information about the
number of planes to be copied. Subsumed by the more correct
Instead, move this code to the new ff_sws_op_list_is_noop().
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
And use it in ff_sws_compile_pass() instead of hard-coding the check there.
This check will become more sophisticated in the following commits.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
The current logic implicitly pulled the new value range out of SwsComps using
ff_sws_apply_op_q(), but this was quite ill-formed and not very robust. In
particular, it only worked because of the implicit assumption that the value
range was always set to 0b1111...111.
This actually poses a serious problem for 32-bit packed formats, whose
value range actually does not fit into AVRational. In the past, it only
worked because the value would implicitly overflow to -1, which SWS_OP_UNPACK
would then correctly extract the bits out from again.
In general, it's cleaner (and sufficient) to just explicitly reset the value
range on SWS_OP_UNPACK again.
The current behavior of assuming the value range implicitly on SWS_OP_READ
has a number of serious drawbacks and shortcomings:
- It ignored the effects of SWS_OP_RSHIFT, such as for p010 and related
MSB-aligned formats. (This is actually a bug)
- It adds a needless dependency on the "purely informative" src/dst fields
inside SwsOpList.
- It is difficult to reason about when acted upon by SWS_OP_SWAP_BYTES, and
the existing hack of simply ignoring SWAP_BYTES on the value range is not
a very good solution here.
Instead, we need a more principled way for the op list generating code
to communicate extra metadata about the operations read to the optimizer.
I think the simplest way of doing this is to allow the SwsComps field attached
to SWS_OP_READ to carry additional, user-provided information about the values
read.
This requires changing ff_sws_op_list_update_comps() slightly to not completely
overwrite SwsComps on SWS_OP_READ, but instead merge the implicit information
with the explictly provided one.
I think this is ultimately a better home, since the semantics of this are
not really tied to optimization itself; and because I want to make it an
explicitly suported part of the user-facing API (rather than just an
internal-use field).
The secondary motivating reason here is that I intend to use internal
helpers of `ops.c` inside the next commit. (Though this is a weak reason
on its own, and not sufficient to justify this move by itself.)
To improve decorrelation between components, we offset the dither matrix
slightly for each component. This is currently done by adding a hard-coded
offset of {0, 3, 2, 5} to each of the four components, respectively.
However, this represents a serious challenge when re-ordering SwsDitherOp
past a swizzle, or when splitting an SwsOpList into multiple sub-operations
(e.g. for decoupling luma from subsampled chroma when they are independent).
To fix this on a fundamental level, we have to keep track of the offset per
channel as part of the SwsDitherOp metadata, and respect those values at
runtime.
This commit merely adds the metadata; the update to the underlying backends
will come in a follow-up commit. The FATE change is merely due to the
added offsets in the op list print-out.
This function uses ff_sws_pixel_type_size to switch on the
size of the provided type. However, ff_sws_pixel_type_size returns
a size in bytes (from sizeof()), not a size in bits. Therefore,
this would previously never return the right thing but always
hit the av_unreachable() below.
As the function is entirely unused, just remove it.
This fixes compilation with MSVC 2026 18.0 when targeting ARM64,
which previously hit an internal compiler error [1].
[1] https://developercommunity.visualstudio.com/t/Internal-Compiler-Error-targeting-ARM64-/10962922
This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
floating point operations. While this is not yet 100% coverage, it's good
enough for the vast majority of formats out there.
Of special note is the packed shuffle fast path, which uses pshufb at vector
sizes up to AVX512.
Provides a generic fast path for any operation list that can be decomposed
into a series of memcpy and memset operations.
25% faster than the x86 backend for yuv444p -> yuva444p
33% faster than the x86 backend for gray -> yuvj444p
This will serve as a reference for the SIMD backends to come. That said,
with auto-vectorization enabled, the performance of this is not atrocious.
It easily beats the old C code and sometimes even the old SIMD.
In theory, we can dramatically speed it up by using GCC vectors instead of
arrays, but the performance gains from this are too dependent on exact GCC
versions and flags, so it practice it's not a substitute for a SIMD
implementation.
This handles the low-level execution of an op list, and integration into
the SwsGraph infrastructure. To handle frames with insufficient padding in
the stride (or a width smaller than one block size), we use a fallback loop
that pads the last column of pixels using `memcpy` into an appropriately
sized buffer.
See docs/swscale-v2.txt for an in-depth introduction to the new approach.
This commit merely introduces the ops definitions and boilerplate functions.
The subsequent commits will flesh out the underlying implementation.