The major consequence of this is that we start allocating buffers per plane,
instead of allocating one contiguous buffer. This makes the no-op/refcopy
case slightly slower, but doesn't meaningfully affect the rest:
yuva444p -> yuva444p, time=157/1000 us (ref=78/1000 us), speedup=0.497x slower
Overall speedup=1.016x faster, min=0.983x max=1.092x
However, this is a necessary consequence of the desire to allow partial plane
allocations / single plane refcopies. This slowdown also does not affect
vf_scale, which already uses avfilter/framepool.c (via ff_get_video_buffer).
Signed-off-by: Niklas Haas <git@haasn.dev>
The legacy API is defined by sws_init_context(), sws_scale() etc., whereas
the "modern" API is defined by just using sws_scale_frame() without prior
init call.
This int allows us to cleanly distinguish the type of context, paving the
way for some minor refactoring.
As an immediate benefit, we now gain a bunch of explict error checks to
ensure the API is used correctly (i.e. sws_scale() not called before
sws_init_context()).
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is a bit more forward-facing than a bare allocation, and importantly,
allows the `swscale/utils.c` code to remain agnostic about how to correctly
uninit this struct.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Add ARM64 NEON-accelerated unscaled YUV-to-RGB conversion for planar
YUV input formats. This extends the existing NV12/NV21 NEON paths with
YUV420P, YUV422P, and YUVA420P support for all packed RGB output
formats (ARGB, RGBA, ABGR, BGRA, RGB24, BGR24) and planar GBRP.
Register with ff_yuv2rgb_init_aarch64() to also cover the scaled path.
checkasm: all 42 sw_yuv2rgb tests pass.
Speedup vs C at 1920px width (Apple M3 Max, avg of 20 runs):
yuv420p->rgb24: 4.3x yuv420p->argb: 3.1x
yuv422p->rgb24: 5.5x yuv422p->argb: 4.1x
yuva420p->argb: 3.5x yuva420p->rgba: 3.5x
Signed-off-by: David Christle <dev@christle.is>
Up until now, several altivec-specific fields are directly
put into SwsInternal #if HAVE_ALTIVEC is true. These fields
are of altivec-specific vector types and therefore
require altivec specific headers to be included.
Unfortunately, said altivec specific headers redefine
bool in a manner that is incompatible with stdbool.
swscale/ops.h uses bool and this led graph.c and ops.c
to disagree about the layout of structures from ops.h,
leading to heap corruption [1], [2] in the sws-unscaled
FATE test.
Fix this by moving the altivec-specific parts out of SwsInternal
and into a structure that extends SwsInternal and is allocated
jointly with it. Said structure is local to yuv2rgb_altivec.c,
because this is the only file accessing said fields. Should
more files need them, an altivec-specific swscale header would
need to be added.
Thanks to jfiusdq <jfiusdq@proton.me> for analyzing the issue.
[1]: https://fate.ffmpeg.org/report.cgi?slot=ppc64-linux-gcc-14.3-asan&time=20260224065643
[2]: https://fate.ffmpeg.org/report.cgi?slot=ppc64-linux-gcc-14.3&time=20260224051615
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This is in preparation for removing the util_altivec.h inclusion
in swscale_internal.h, which causes problems (on PPC) because
it redefines bool to something different from stdbool.h.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Prepare for xyz12Torgb48 architecture-specific optimizations in
subsequent patches by:
- Grouping XYZ+RGB gamma LUTs and 3x3 matrices into SwsColorXform
(ctx->xyz2rgb and ctx->rgb2xyz), replacing scattered fields.
- Dropping the unused last matrix column giving the same or smaller
SwsInternal size.
- Renaming ff_xyz12Torgb48 and ff_rgb48Toxyz12 and routing calls via
the new per-context function pointer (ctx->xyz12Torgb48 and
ctx->rgb48Toxyz12) in graph.c and swscale.c.
- Adding ff_sws_init_xyzdsp and invoking it in swscale init paths
(normal and unscaled).
- Making fill_xyztables public to ease its setup later in checkasm.
These modifications do not introduce any functional changes.
Signed-off-by: Arpad Panyik <Arpad.Panyik@arm.com>
The current logic uses 12-bit linear light math, which is woefully insufficient
and leads to nasty postarization artifacts. This patch simply switches the
internal logic to 16-bit precision.
This raises the memory requirement of these tables from 32 kB to 272 kB.
All relevant FATE tests updated for improved accuracy.
Fixes: #4829
Signed-off-by: Niklas Haas <git@haasn.dev>
Sponsored-by: Sovereign Tech Fund
This setting can be used to infuence the type of tone and gamut mapping used
internally when color space conversions are required. As discussed at VDD'24,
the default was set to relative colorimetric clipping, which is approximately
associative, surjective and idempotent. As such, it roundtrips well, although
it is strictly speaking not associative on out-of-gamut colors.
There is an issue with the constants used in YUV to YUV range conversion,
where the upper bound is not respected when converting to mpeg range.
With this commit, the constants are calculated at runtime, depending on
the bit depth. This approach also allows us to more easily understand how
the constants are derived.
For bit depths <= 14, the number of fixed point bits has been set to 14
for all conversions, to simplify the code.
For bit depths > 14, the number of fixed points bits has been raised and
set to 18, to allow for the conversion to be accurate enough for the mpeg
range to be respected.
The convert functions now take the conversion constants (coeff and offset)
as function arguments.
For bit depths <= 14, coeff is unsigned 16-bit and offset is 32-bit.
For bit depths > 14, coeff is unsigned 32-bit and offset is 64-bit.
x86_64:
chrRangeFromJpeg8_1920_c: 2127.4 2125.0 (1.00x)
chrRangeFromJpeg16_1920_c: 2325.2 2127.2 (1.09x)
chrRangeToJpeg8_1920_c: 3166.9 3168.7 (1.00x)
chrRangeToJpeg16_1920_c: 2152.4 3164.8 (0.68x)
lumRangeFromJpeg8_1920_c: 1263.0 1302.5 (0.97x)
lumRangeFromJpeg16_1920_c: 1080.5 1299.2 (0.83x)
lumRangeToJpeg8_1920_c: 1886.8 2112.2 (0.89x)
lumRangeToJpeg16_1920_c: 1077.0 1906.5 (0.56x)
aarch64 A55:
chrRangeFromJpeg8_1920_c: 28835.2 28835.6 (1.00x)
chrRangeFromJpeg16_1920_c: 28839.8 32680.8 (0.88x)
chrRangeToJpeg8_1920_c: 23074.7 23075.4 (1.00x)
chrRangeToJpeg16_1920_c: 17318.9 24996.0 (0.69x)
lumRangeFromJpeg8_1920_c: 15389.7 15384.5 (1.00x)
lumRangeFromJpeg16_1920_c: 15388.2 17306.7 (0.89x)
lumRangeToJpeg8_1920_c: 19227.8 19226.6 (1.00x)
lumRangeToJpeg16_1920_c: 15387.0 21146.3 (0.73x)
aarch64 A76:
chrRangeFromJpeg8_1920_c: 6324.4 6268.1 (1.01x)
chrRangeFromJpeg16_1920_c: 6339.9 11521.5 (0.55x)
chrRangeToJpeg8_1920_c: 9656.0 9612.8 (1.00x)
chrRangeToJpeg16_1920_c: 6340.4 11651.8 (0.54x)
lumRangeFromJpeg8_1920_c: 4422.0 4420.8 (1.00x)
lumRangeFromJpeg16_1920_c: 4420.9 5762.0 (0.77x)
lumRangeToJpeg8_1920_c: 5949.1 5977.5 (1.00x)
lumRangeToJpeg16_1920_c: 4446.8 5946.2 (0.75x)
NOTE: all simd optimizations for range_convert have been disabled.
they will be re-enabled when they are fixed for each architecture.
NOTE2: the same issue still exists in rgb2yuv conversions, which is not
addressed in this commit.
As part of a larger, ongoing effort to modernize and partially rewrite
libswscale, it was decided and generally agreed upon to introduce a new
public API for libswscale. This API is designed to be less stateful, more
explicitly defined, and considerably easier to use than the existing one.
Most of the API work has been already accomplished in the previous commits,
this commit merely introduces the ability to use sws_scale_frame()
dynamically, without prior sws_init_context() calls. Instead, the new API
takes frame properties from the frames themselves, and the implementation is
based on the new SwsGraph API, which we simply reinitialize as needed.
This high-level wrapper also recreates the logic that used to live inside
vf_scale for scaling interlaced frames, enabling it to be reused more easily
by end users.
Finally, this function is designed to simply copy refs directly when nothing
needs to be done, substantially improving throughput of the noop fast path.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Following in the footsteps of the work in the previous commit, it's now
relatively straightforward to expose the options struct publicly as
SwsContext. This is a step towards making this more user friendly, as
well as following API conventions established elsewhere.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is a preliminary step to separating these into a new struct. This
commit contains no functional changes, it is a pure search-and-replace.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This commit also fixes the issue that the call to ff_sws_init_range_convert()
from sws_init_swscale() was not setting up the arch-specific optimizations.
And preserve the public SwsContext as separate name. The motivation here
is that I want to turn SwsContext into a public struct, while keeping the
internal implementation hidden. Additionally, I also want to be able to
use multiple internal implementations, e.g. for GPU devices.
This commit does not include any functional changes. For the most part, it is
a simple rename. The only complications arise from the public facing API
functions, which preserve their current type (and hence require an additional
unwrapping step internally), and the checkasm test framework, which directly
accesses SwsInternal.
For consistency, the affected functions that need to maintain a distionction
have generally been changed to refer to the SwsContext as *sws, and the
SwsInternal as *c.
In an upcoming commit, I will provide a backing definition for the public
SwsContext, and update `sws_internal()` to dereference the internal struct
instead of merely casting it.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
I want to pull options out of SwsInternal, so we need to make this field
a dedicated int that gets updated as appropriate in ff_swscale().
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
Used as an intermediate entry point for the new swscale context. The extra
constification is a consistency measure, as I want to move the memcpy of
stride and plane pointers to the functions that actually need to mutate them.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
I want to move away from having random leaf processing functions mutate
plane pointers, and while we're at it, we might as well make the strides
and tables const as well.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
This is by no means perfect, since at least ddagrab will return scRGB
data with values outside of 0.0f to 1.0f for HDR values.
Its primary purpose is to be able to work with the format at all.
Some of these were made possible by moving several common macros to
libavutil/macros.h.
While just at it, also improve the other headers a bit.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes so that fate under 64 bit Windows passes.
These functions replace all ff_hscale8to15_*_ssse3 when avx2 is available.
Signed-off-by: James Almer <jamrial@gmail.com>