The US country_code path in parse_itut_t35_metadata() reads the
the provider_code with bytestream2_get_be16u(), which is a
unchecked version that does not validate the remaining
length before reading. When an AV1 stream contains ITU-T T.35
metadata with country_code set to 0xB5 (which is US) and a
payload shorter than 2 bytes, this results in a heap overflow
reading 2 bytes past the allocation.
The UK country code already guards against this issue by
checking it before the unchecked read. We're using the same
pattern to the US country code path.
Pwno crafted an AV1 IVF with a metadata OBU containing ITU-T T.35
with country_code=0xB5 and a 1-byte payload. Decoding with libdav1d
triggers the overflow. ASan says:
ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 2 at 0x5020000003f0 thread T0
#0 bytestream_get_be16 src/libavcodec/bytestream.h:98
#1 bytestream2_get_be16u src/libavcodec/bytestream.h:98
#2 parse_itut_t35_metadata src/libavcodec/libdav1d.c:376
0x5020000003f1 is located 0 bytes after 1-byte region
Found-by: Pwno
Allows the compiler to optimize the the aliasing checks away
and saves 5376B here (GCC 15, -O3).
Also, avoid converting the stride to uint16_t for >8bpp:
stride /= sizeof(pixel) will use an unsigned division
(i.e. a logical right shift)*, which is not what is intended here.
*: If size_t is the corresponding unsigned type to ptrdiff_t
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Apple VideoToolbox is the dominant producer of hevc-alpha videos, but
early versions generates non-standard VPS extensions that fail to
parse and return AVERROR_INVALIDDATA. Fix this by returning
AVERROR_PATCHWELCOME instead of AVERROR_INVALIDDATA for unsupported
VPS extension configurations. Setting poc_lsb_not_present for the
alpha layer in the fallback path when it has no direct dependency
on the base layer, so that IDR slices on the alpha layer won't
incorrectly read pic_order_cnt_lsb.
Fix#22384
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
ff_frame_new_side_data() may set sd to NULL and return 0 when
side_data_pref() determines that existing side data should be
preferred.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
ff_frame_new_side_data() may set sd to NULL and return 0 when
side_data_pref() determines that existing side data should be
preferred.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
It is only needed in the unlikely codepath. The ordinary one
only uses six xmm registers.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Add NEON-optimized implementation for HEVC intra Planar prediction at
8-bit depth, supporting all block sizes (4x4 to 32x32).
Planar prediction implements bilinear interpolation using an incremental
base update: base_{y+1}[x] = base_y[x] - (top[x] - left[N]), reducing
per-row computation from 4 multiply-adds to 1 subtract + 1 multiply.
Uses rshrn for rounded narrowing shifts, eliminating manual rounding
bias. All left[y] values are broadcast in the NEON domain, avoiding
GP-to-NEON transfers.
4x4 interleaves row computations across 4 rows to break dependencies.
16x16 uses v19-v22 for persistent base/decrement vectors, avoiding
callee-saved register spills. 32x32 processes 8 rows per loop iteration
(4 iterations total) to reduce code size while maintaining full NEON
utilization.
Speedup over C on Apple M4 (checkasm --bench):
4x4: 2.25x 8x8: 6.40x 16x16: 9.72x 32x32: 3.21x
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Add NEON-optimized implementation for HEVC intra DC prediction at 8-bit
depth, supporting all block sizes (4x4 to 32x32).
DC prediction computes the average of top and left reference samples
using uaddlv, with urshr for rounded division. For luma blocks smaller
than 32x32, edge smoothing is applied: the first row and column are
blended toward the reference using (ref[i] + 3*dc + 2) >> 2 computed
entirely in the NEON domain. Fill stores use pre-computed address
patterns to break dependency chains.
Also adds the aarch64 initialization framework (Makefile, pred.c/pred.h
hooks, hevcpred_init_aarch64.c).
Speedup over C on Apple M4 (checkasm --bench):
4x4: 2.28x 8x8: 3.14x 16x16: 3.29x 32x32: 3.02x
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
In case of >8bpp, there is already a zero register available
(for clipping); in case of Unix64, one can simply use an
unused register. Doing so reduces codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoids push+pop on Win64; in any case, using registers m0-m7
more often saves codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Avoids push+pop on Win64; in any case, using registers m0-m7
more often saves codesize.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Also use a register in the 0-7 range as clobber reg,
as this reduces codesize (by 51B).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The height 8 and 16 cases differ from the second BDOF mini block onwards,
but even the beginning of said mini block is the same and can therefore
be deduplicated. This saves 821B here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
m8 here (corresponding to a mix of sgx2 and sgy2 in derive_bdof_vx_vy
in the C version) is always nonnegative, so the psignd boils down to
a check for m8 being zero. But if an entry of m8 is zero, then
the corresponding entry of m9 is automatically zero, too, as sgx2
being zero implies sgxdi being zero and sgy2 implies sgxgy, sgydi
being zero.* So just remove these redundant instructions.
*: In other words, one could remove the sgx2,sgy2>0 checks from
the end of derive_bdof_vx_vy() as long as av_log2(0) is defined.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
For pre-AVX2, vpbroadcastw is emulated via a load, followed
by two shuffles. Yet given that one always wants to splat
multiple pairs of coefficients which are adjacent in memory,
one can do better than that: Load all of them at once, perform
a punpcklwd with itself and use one pshufd per register.
In case one has to sign-extend the coefficients, too,
one can replace the punpcklwd with one pmovsxbw (instead of one
per register) and use pshufd directly afterwards.
This saved 4816B of .text here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
8 tap motion compensation functions with both vertical and horizontal
components are under severe register pressure, so that the filter
coefficients have to be put on the stack. Before this commit,
this meant that coefficients for use with pmaddubsw and pmaddwd
were always created. Yet this is completely unnecessary, as
every such register is only used for exactly one purpose and
it is known at compile time which one it is (only 8bit horizontal
filters are used with pmaddubsw), so only prepare that one.
This also allows to half the amount of stack used.
This saves 2432B of .text here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It has already been checked before that we are only dealing
with high bitdepth here.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Since ba793127c4,
the x86 mpeg4videodsp code uses ff_emulated_edge_mc_sse2()
instead of ff_emulated_edge_mc_8. This leads to linker errors
when x86asm is disabled. Fix this by also falling back to ff_gmc_c()
in case edge emulation is needed with external SSE2 being unavailable.
An alternative is to go back to ff_emulated_edge_mc_8(), but this
would readd the uglyness to videodsp for a niche case.
Reported-by: James Almer <jamrial@gmail.com>
Reviewed-by: Hendrik Leppkes <h.leppkes@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>