Apple VideoToolbox is the dominant producer of hevc-alpha videos, but
early versions generates non-standard VPS extensions that fail to
parse and return AVERROR_INVALIDDATA. Fix this by returning
AVERROR_PATCHWELCOME instead of AVERROR_INVALIDDATA for unsupported
VPS extension configurations. Setting poc_lsb_not_present for the
alpha layer in the fallback path when it has no direct dependency
on the base layer, so that IDR slices on the alpha layer won't
incorrectly read pic_order_cnt_lsb.
Fix#22384
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
ff_frame_new_side_data() may set sd to NULL and return 0 when
side_data_pref() determines that existing side data should be
preferred.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
ff_frame_new_side_data() may set sd to NULL and return 0 when
side_data_pref() determines that existing side data should be
preferred.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
libvidstab's vsTransformPrepare() takes different internal code paths
for in-place (src == dest) vs. separate-buffer operation. The
separate-buffer path stores a shallow copy of the source frame pointer
in td->src without allocating internal memory (srcMalloced stays 0).
When a subsequent frame takes the in-place path, vsFrameIsNull(&td->src)
is false so vsFrameAllocate() is skipped, and vsFrameCopy() writes into
the stale pointer left over from the previous frame, corrupting memory
that the caller no longer owns.
Whether a given frame is writable depends on pipeline scheduling and
frame reference management, which can change between FFmpeg versions.
Since FFmpeg 8.1, changes in the scheduler caused some frames to arrive
as non-writable, leading to alternation between in-place and
separate-buffer paths that triggered the bug.
Fix this by marking the input pad with AVFILTERPAD_FLAG_NEEDS_WRITABLE.
Fix#22595
We currently don't have any cases where this is needed, but include
it for completeness and clarity.
These macros for BTI were added in
08b4716a9e.
A later comment in this file, added in
248986a0db, referenced the macro
AARCH64_VALID_JUMP_CALL_TARGET which never was added here before.
Unit test covering av_video_enc_params_alloc,
av_video_enc_params_block, and
av_video_enc_params_create_side_data.
Tests allocation for all three codec types (VP9, H264, MPEG2) and
the NONE type, with 0 and 4 blocks, with and without size output.
Verifies block getter indexing by writing and reading back
coordinates, dimensions, and delta_qp values. Tests frame-level qp
and delta_qp fields, and side data creation with frame attachment.
Coverage for libavutil/video_enc_params.c: 0.00% -> 86.21%
(remaining uncovered lines are OOM error paths)
Signed-off-by: marcos ashton <marcosashiglesias@gmail.com>
Unit test covering av_detection_bbox_alloc, av_get_detection_bbox,
and av_detection_bbox_create_side_data.
Tests allocation with 0, 1, and 4 bounding boxes, with and without
size output. Verifies bbox getter indexing by writing and reading
back coordinates, labels, and confidence values. Tests classify
fields (labels and confidences), the header source field, and
side data creation with frame attachment.
Coverage for libavutil/detection_bbox.c: 0.00% -> 86.67%
(remaining uncovered lines are OOM error paths)
Signed-off-by: marcos ashton <marcosashiglesias@gmail.com>
Unit test covering all 4 public API functions in libavutil/spherical.c:
av_spherical_alloc, av_spherical_projection_name, av_spherical_from_name,
and av_spherical_tile_bounds.
Tests allocation with and without size output, all 7 projection type
name lookups, projection name round-trip verification, out-of-range
handling, and tile bounds computation for full-frame, quarter-tile,
and centered-tile configurations.
Coverage for libavutil/spherical.c: 0.00% -> 100.00%
Signed-off-by: marcos ashton <marcosashiglesias@gmail.com>
It is only needed in the unlikely codepath. The ordinary one
only uses six xmm registers.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Only the process functions are entered via an indirect _call_ from C.
The kernel functions and process_return are dispatched to by indirect
_branches_ instead (continuation-passing style design).
Make use of the recently added "jumpable" parameter to the function
macro in libavutil/aarch64/asm.S to fix these functions when BTI is
enabled.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
The function macro emits AARCH64_VALID_CALL_TARGET for exported symbols,
marking them as valid destinations for indirect _calls_. Functions that
are reached by indirect _branches_ (i.e. tail-call dispatch chains
where the link register is not set) require AARCH64_VALID_JUMP_TARGET
instead.
This commit adds a "jumpable" parameter to the function macro that, when
set, emits AARCH64_VALID_JUMP_TARGET instead of AARCH64_VALID_CALL_TARGET.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>
Using AMF interfaces in C can be cumbersome and visually difficult to process in some cases: i.e.: object->function(object, args). To improve code readability, a new macro is added. This commit is instrumental for future AMF integration refactoring.
-vf_vpp_amf.c: Remove unused variables.
-vf_amf_common.c: Fix hdrmeta_buffer memory leak.
-hwcontext_amf.c: Fix av_amf_extract_hdr_metadata not picking up light metadata if display mastering metadata is not set.
-doc/filters.texi: Remove irrelevant example with HDR metadata for vpp_amf.
The use of code section (.text) was forced by the unreleased NASM
3.02rc3 which made the issue worse, but preventing assambling anything
without code section, including when only data was present.
This works fine for the most part, but using code (.text) section with
IMAGE_COMDAT_SELECT_ANY causes issues with lib.exe after stripping such
object:
fatal error LNK1143: invalid or corrupt file: no symbol for COMDAT section 0x2
Esentially it makes our workaround not work in all cases, and while
string could be disabled like it already is for MSVC/ICL builds, it used
to work so let's preserve that state.
This make it not compatible with NASM 3.02rc3 when CV debug info is
generated, but hopefully the upstream fix will be merged before release,
to avoid this regression:
https://github.com/netwide-assembler/nasm/pull/221
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
Add NEON-optimized implementation for HEVC intra Planar prediction at
8-bit depth, supporting all block sizes (4x4 to 32x32).
Planar prediction implements bilinear interpolation using an incremental
base update: base_{y+1}[x] = base_y[x] - (top[x] - left[N]), reducing
per-row computation from 4 multiply-adds to 1 subtract + 1 multiply.
Uses rshrn for rounded narrowing shifts, eliminating manual rounding
bias. All left[y] values are broadcast in the NEON domain, avoiding
GP-to-NEON transfers.
4x4 interleaves row computations across 4 rows to break dependencies.
16x16 uses v19-v22 for persistent base/decrement vectors, avoiding
callee-saved register spills. 32x32 processes 8 rows per loop iteration
(4 iterations total) to reduce code size while maintaining full NEON
utilization.
Speedup over C on Apple M4 (checkasm --bench):
4x4: 2.25x 8x8: 6.40x 16x16: 9.72x 32x32: 3.21x
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Add NEON-optimized implementation for HEVC intra DC prediction at 8-bit
depth, supporting all block sizes (4x4 to 32x32).
DC prediction computes the average of top and left reference samples
using uaddlv, with urshr for rounded division. For luma blocks smaller
than 32x32, edge smoothing is applied: the first row and column are
blended toward the reference using (ref[i] + 3*dc + 2) >> 2 computed
entirely in the NEON domain. Fill stores use pre-computed address
patterns to break dependency chains.
Also adds the aarch64 initialization framework (Makefile, pred.c/pred.h
hooks, hevcpred_init_aarch64.c).
Speedup over C on Apple M4 (checkasm --bench):
4x4: 2.28x 8x8: 3.14x 16x16: 3.29x 32x32: 3.02x
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Add checkasm test for HEVC intra prediction covering DC, planar, and
angular modes at all block sizes (4x4 to 32x32) for 8-bit and 10-bit
depth.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
In case of >8bpp, there is already a zero register available
(for clipping); in case of Unix64, one can simply use an
unused register. Doing so reduces codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoids push+pop on Win64; in any case, using registers m0-m7
more often saves codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoids push+pop on Win64; in any case, using registers m0-m7
more often saves codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Also use a register in the 0-7 range as clobber reg,
as this reduces codesize (by 51B).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The height 8 and 16 cases differ from the second BDOF mini block onwards,
but even the beginning of said mini block is the same and can therefore
be deduplicated. This saves 821B here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
m8 here (corresponding to a mix of sgx2 and sgy2 in derive_bdof_vx_vy
in the C version) is always nonnegative, so the psignd boils down to
a check for m8 being zero. But if an entry of m8 is zero, then
the corresponding entry of m9 is automatically zero, too, as sgx2
being zero implies sgxdi being zero and sgy2 implies sgxgy, sgydi
being zero.* So just remove these redundant instructions.
*: In other words, one could remove the sgx2,sgy2>0 checks from
the end of derive_bdof_vx_vy() as long as av_log2(0) is defined.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This commit pieces together the previous few commits to implement the
NEON backend for sws_ops.
In essence, a tool which runs on the target (sws_ops_aarch64) is used
to enumerate all the functions that the backend needs to implement. The
list it generates is stored in the repository (ops_entries.c).
The list from above is used at build time by a code generator tool
(ops_asmgen) to implement all the sws_ops functions the NEON backend
supports, and generate a lookup function in C to retrieve the assembly
function pointers.
At runtime, the NEON backend fetches the function pointers to the
assembly functions and chains them together in a continuation-passing
style design, similar to the x86 backend.
The following speedup is observed from legacy swscale to NEON:
A520: Overall speedup=3.780x faster, min=0.137x max=91.928x
A720: Overall speedup=4.129x faster, min=0.234x max=92.424x
And the following from the C sws_ops implementation to NEON:
A520: Overall speedup=5.513x faster, min=0.927x max=14.169x
A720: Overall speedup=4.786x faster, min=0.585x max=20.157x
The slowdowns from legacy to NEON are the same for C/x86. Mostly low
bit-depth conversions that did not perform dithering in legacy.
The 0.585x outlier from C to NEON is gbrpf32le -> gbrapf32le, which is
mostly memcpy with the C implementation. All other conversions are
better.
Sponsored-by: Sovereign Tech Fund
Signed-off-by: Ramiro Polla <ramiro.polla@gmail.com>