This commit pieces together the previous few commits to implement the
NEON backend for sws_ops.
In essence, a tool which runs on the target (sws_ops_aarch64) is used
to enumerate all the functions that the backend needs to implement. The
list it generates is stored in the repository (ops_entries.c).
The list from above is used at build time by a code generator tool
(ops_asmgen) to implement all the sws_ops functions the NEON backend
supports, and generate a lookup function in C to retrieve the assembly
function pointers.
At runtime, the NEON backend fetches the function pointers to the
assembly functions and chains them together in a continuation-passing
style design, similar to the x86 backend.
The following speedup is observed from legacy swscale to NEON:
A520: Overall speedup=3.780x faster, min=0.137x max=91.928x
A720: Overall speedup=4.129x faster, min=0.234x max=92.424x
And the following from the C sws_ops implementation to NEON:
A520: Overall speedup=5.513x faster, min=0.927x max=14.169x
A720: Overall speedup=4.786x faster, min=0.585x max=20.157x
The slowdowns from legacy to NEON are the same for C/x86. Mostly low
bit-depth conversions that did not perform dithering in legacy.
The 0.585x outlier from C to NEON is gbrpf32le -> gbrapf32le, which is
mostly memcpy with the C implementation. All other conversions are
better.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The NEON sws_ops backend follows the same continuation-passing style
design as the x86 backend.
Unlike the C and x86 backends, which implement the various operation
functions through the use of templates and preprocessor macros, the
NEON backend uses a build-time code generator, which is introduced by
this commit.
This code generator has two modes of operation:
-ops:
Generates an assembly file in GNU assembler syntax targeting AArch64,
which implements all the sws_ops functions the NEON backend supports.
-lookup:
Generates a C function with a hierarchical condition chain that
returns the pointer to one of the functions generated above, based on
a given set of parameters derived from SwsOp.
This is the core of the NEON sws_ops backend.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The runtime assembler interface provides an instruction-level IR and
builder API tailored to the needs of the swscale dynamic pipeline.
It is not meant to be a general purpose assembler interface.
Currently only a static file backend, which emits GNU assembler text,
has been implemented. In the future, this interface will be used to
write functions dynamically at runtime.
This code will be compiled both for runtime usage to generate optimized
functions and for build-time usage to generate static assembly files.
Therefore, it must not depend on internal FFmpeg libraries.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The NEON sws_ops backend will use a build-time code generator for the
various operation functions it needs to implement. This build time code
generator (ops_asmgen) will need a list of the operations that must be
implemented. This commit adds a tool (sws_ops_aarch64) that generates
such a list (ops_entries.c).
The list is generated by iterating over all possible conversion
combinations and collecting the parameters for each NEON assembly
function that has to be implemented, defined by an unique set of
parameters derived from SwsOp. Whenever swscale evolves, with improved
optimization passes, new pixel formats, or improvements to the backend
itself, this file (ops_entries.c) should be regenerated by running:
$ make sws_ops_entries_aarch64
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
This is needed to cover the case when assembled source doesn't have
.text section. NASM documentation suggest to add $ suffix to section
name for COMDAT in .text, but this actually requires the main .text
section to exist also. And use less generic suffix for our dummy
sub-section.
Third time's the charm.
Fixes: 80cd067715
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
The existing fate-lavf-yuv420p.y4m covers only the default format.
Add four entries that pass -pix_fmt explicitly to the lavf_video
macro: yuv422p, yuv444p, yuv411p, and gray.
These exercise the branches in yuv4mpegpipe_write_header() that write
the "C422", "C444", "C411", and "Cmono" chroma descriptor strings in
the stream header. All four are gated on ENCDEC(RAWVIDEO,YUV4MPEGPIPE)
and added to FATE_LAVF_VIDEO_SCALE so they inherit the requirement for
CONFIG_SCALE_FILTER that lavf_video's -auto_conversion_filters needs.
Reference files were generated from the actual encoder output and
follow the md5+size+CRC format used by the other lavf references.
Signed-off-by: Soham Kute <officialsohamkute@gmail.com>
Add tests/api/api-enc-parser-test.c, a generic encoder+parser round-trip
test that takes codec_name, width, and height on the command line
(defaults: h261 176 144).
Three cases are tested:
garbage - a single av_parser_parse2() call on 8 bytes with no Picture
Start Code; verifies out_size == 0 so the parser emits no spurious data.
bulk - encodes 2 frames, concatenates the raw packets, feeds the whole
buffer to a fresh parser in one call, then flushes. Verifies that
exactly 2 non-empty frames come out and that the parser found the PSC
boundary between them.
split - the same buffer fed in two halves (chunk boundary falls inside
frame 0). Verifies the parser still emits exactly 2 frames when input
arrives incrementally, and that the collected bytes are identical to
the bulk output (checked with memcmp).
Implementation notes: avcodec_get_supported_config() selects the pixel
format; chroma height uses AV_CEIL_RSHIFT with log2_chroma_h from
AVPixFmtDescriptor; data[1] and data[2] are checked independently so
semi-planar formats work; the encoded buffer is given
AV_INPUT_BUFFER_PADDING_SIZE zero bytes at the end; parse_stream()
skips the fed chunk if consumed==0 to prevent an infinite loop.
Two FATE entries in tests/fate/api.mak: QCIF (176x144) and CIF
(352x288), both standard H.261 resolutions.
Signed-off-by: Soham Kute <officialsohamkute@gmail.com>
The original test only mapped the source file and printed its content,
exercising none of the error branches in av_file_map().
Replace it with a test that maps a real file (path via argv[1] for
out-of-tree builds) and verifies it is non-empty, then calls
av_file_map() on a nonexistent file twice: once with log_offset=0 to
confirm the error is logged at AV_LOG_ERROR, and once with log_offset=1
to confirm the level is raised by one, covering the
log_level_offset_offset path in av_vlog(). A custom av_log callback
captures the emitted level independently of the global log level.
The two error cases share a single for() loop to avoid duplication.
Add a FATE entry in tests/fate/libavutil.mak with CMP=null since
there is no fixed stdout to compare.
Signed-off-by: Soham Kute <officialsohamkute@gmail.com>
This is consistent pattern with other files. Also is needed for next
commit to always include x86util.asm
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
Fix the default value of mpegts_original_network_id from 0x0001 to
0xff01 to match the actual code (DVB_PRIVATE_NETWORK_START).
Add the missing hevc_digital_hdtv service type to the
mpegts_service_type option list.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
While "cc + 1 & 0xf" is technically correct because addition has
higher precedence than bitwise AND in C, the intent of "(cc + 1) & 0xf"
is not immediately obvious without recalling the precedence table.
Add explicit parentheses to make the intended evaluation order clear
and improve readability.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Instead of this needlessly complicated dance of allocating on-stack copies
of SwsOpList only to iterate with AVERROR(EAGAIN).
This was originally thought to be useful for compiling multiple ops at once,
but even that can be solved in easier ways.
Signed-off-by: Niklas Haas <git@haasn.dev>
This is now fully redundant with the previous op's output; because unused
components are always marked as garbage on the input side.
Signed-off-by: Niklas Haas <git@haasn.dev>
Needed for the upcoming removal of op->comps.unused[]. This keeps the
dependency array entirely within the ff_sws_op_list_update_comps() function,
apart from being arguably simpler and easier to follow.
Signed-off-by: Niklas Haas <git@haasn.dev>
Just define these directly as integer arrays; there's really no point in
having them re-use SwsSwizzleOp; the only place this was ever even remotely
relevant was in the no-op check, which any decent compiler should already
be capable of optimizing into a single 32-bit comparison.
Signed-off-by: Niklas Haas <git@haasn.dev>
"Reconfiguring filter graph because video parameters changed to yuv420p10le(pc, bt709), 1920x1080, unspecified alph"
Fixup f07573f
Adding a missing space fixed this.
For pre-AVX2, vpbroadcastw is emulated via a load, followed
by two shuffles. Yet given that one always wants to splat
multiple pairs of coefficients which are adjacent in memory,
one can do better than that: Load all of them at once, perform
a punpcklwd with itself and use one pshufd per register.
In case one has to sign-extend the coefficients, too,
one can replace the punpcklwd with one pmovsxbw (instead of one
per register) and use pshufd directly afterwards.
This saved 4816B of .text here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
8 tap motion compensation functions with both vertical and horizontal
components are under severe register pressure, so that the filter
coefficients have to be put on the stack. Before this commit,
this meant that coefficients for use with pmaddubsw and pmaddwd
were always created. Yet this is completely unnecessary, as
every such register is only used for exactly one purpose and
it is known at compile time which one it is (only 8bit horizontal
filters are used with pmaddubsw), so only prepare that one.
This also allows to half the amount of stack used.
This saves 2432B of .text here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It has already been checked before that we are only dealing
with high bitdepth here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Since ba793127c4,
the x86 mpeg4videodsp code uses ff_emulated_edge_mc_sse2()
instead of ff_emulated_edge_mc_8. This leads to linker errors
when x86asm is disabled. Fix this by also falling back to ff_gmc_c()
in case edge emulation is needed with external SSE2 being unavailable.
An alternative is to go back to ff_emulated_edge_mc_8(), but this
would readd the uglyness to videodsp for a niche case.
Reported-by: James Almer <jamrial@gmail.com>
Reviewed-by: Hendrik Leppkes <h.leppkes@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Some faulty files have an LCEVC descriptor with a single stream, resulting in
a group being created but never fully populated with the current
implementation.
Signed-off-by: James Almer <jamrial@gmail.com>