When check_cflags -mvsx fails, the && short-circuit prevents
check_cc from running. Since check_cc is responsible for
disabling vsx on failure, skipping it leaves vsx incorrectly
enabled.
Fix by removing the && so check_cc always executes.
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
They have been superseded by SSSE3; the SSE2 version was even disabled
(and segfaults if enabled).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Compared to the MMX version, this version benefits from wider
registers and pmaddubsw. It also has fewer unnecessary loads
and stores: On x64, the MMX version has 12 unnecessary GPR loads
and 6 stores in each line when width is eight; for width 16,
there are 17 unnecessary GPR loads and six stores per line.
Even the 32bit SSSE3 version only has six loads and zero stores
per line more than the x64 version. Furthermore, in contrast
to the MMX version, the SSSE3 version also does not clobber
the array of block pointers given to it.
Benchmarks:
inner_add_yblock_2_c: 29.2 ( 1.00x)
inner_add_yblock_2_mmx: 32.5 ( 0.90x)
inner_add_yblock_2_ssse3: 28.6 ( 1.02x)
inner_add_yblock_4_c: 85.2 ( 1.00x)
inner_add_yblock_4_mmx: 89.2 ( 0.96x)
inner_add_yblock_4_ssse3: 84.5 ( 1.01x)
inner_add_yblock_8_c: 302.0 ( 1.00x)
inner_add_yblock_8_mmx: 77.0 ( 3.92x)
inner_add_yblock_8_ssse3: 30.6 ( 9.85x)
inner_add_yblock_16_c: 1164.7 ( 1.00x)
inner_add_yblock_16_mmx: 260.4 ( 4.47x)
inner_add_yblock_16_ssse3: 82.3 (14.15x)
Both the MMX and SSSE3 versions leave the size 2 and 4 cases
to ff_snow_inner_add_yblock_c() (but the MMX version has
a prologue at the beginning that it needs to undo before
the call, leading to the higher overhead for these sizes).
I don't know why the SSSE3 version is marginally faster than
the C version in these cases.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The first loop was never entered due to a precedence problem;
the second loop initialized everything, although it was not intended
that way.
This has been added in 56b8769a1c.
Sorry for this.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Only inner_add_yblock for now.
Hint: Said function uses a pointer to an array of pointers as parameter.
The MMX version clobbers the array in such a way that calling the
function repeatedly with the same arguments (as happens inside bench_new())
leads to buffer overflows and segfaults. Therefore CALL4 had to be
overridden to restore the original pointers. This workaround will be
removed soon when the MMX version is removed.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
It is unnecessary and avoids the src_y parameter;
it also makes this function more ASM-friendly.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The input lines used in ff_snow_inner_add_yblock()
must always be set (because their values are used).
The MMX assembly always relied on this.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This has been done in 561a18d3ba
in order to avoid shifts, yet this rationale no longer applies
since d593e32983. So shift them back;
this is in preparation for using these coefficients together with
pmaddubsw.
Hint: 561a18d3ba also added a block
guarded by "if(LOG2_OBMC_MAX == 8". I changed the condition to remove
this check (i.e. kept the block) which should not change the output
at all. Yet all FATE tests pass if the block is completely
removed. I don't know if this block is necessary at all.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Possible now that the SSE2 function is available
even when the stack is not aligned.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
x86-32 lacks one GPR, so it needs to be read from the stack.
If the stack needs to be realigned, we can no longer access
the original location of one argument, so just request a bit
more stack size and copy said argument at a fixed offset from
the new stack.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Only the lower quadword needs to be rotated, because
the register is zero-extended immediately afterwards anyway.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Write the 24-bit vpcC flags field at the current cursor position after
the version byte. The previous code wrote to p+1 instead of p, leaving
one byte uninitialized between version and flags and shifting all
subsequent fields (profile, level, bitdepth, etc.) by one byte.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Return the actual find_sei_end() error when SEI appending fails instead of
reusing the previous status code. This preserves the real parse failure for
callers instead of reporting malformed SEI handling as success.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
This was originally introduced by commit 05d6cc116e. During the FFmpeg-libav
split, this function was refactored by commit 7e350379f8 into
av_buffersrc_add_frame(), replacing av_buffersrc_add_ref(). The new function
did not include the overflow warning, despite the same being done for
buffersink.
Then, when commit a05a44e205 merged the two functions back together, the
libav implementation was favored over the FFmpeg implementation, silently
removing the overflow warning in the process.
This commit re-adds that missing warning.
Signed-off-by: Niklas Haas <git@haasn.dev>
The multiplanar image with storage_bit enabled fails to be exported
to DMA-BUF on the QCOM turnip driver, thus triggering this double-free issue.
```
[Parsed_hwmap_2 @ 0xffff5c002a70] Configure hwmap vulkan -> drm_prime.
[hwmap @ 0xffff5c001180] Filter input: vulkan, 1920x1080 (0).
[AVHWFramesContext @ 0xffff5c004e00] Unable to export the image as a FD!
free(): double free detected in tcache 2
Aborted
```
Additionally, add back an av_unused attribute. Otherwise, the compiler
will complain about unused variables when CUDA is not enabled.
Signed-off-by: nyanmisaka <nst799610810@gmail.com>
More about deprecating MMX than any performance gain; nearly identical
performance numbers on my Zen 4 (1.36x vs c), but llvm-mca predicts
>60% perf gain on Intel CPUs newer than Skylake.
Signed-off-by: Zuxy Meng <zuxy.meng@gmail.com>
The original intent here was probably to make the ops code agnostic to
which operation is actually last in the list, but the existence of a
divergence between CONTINUE and FINISH already implies that we hard-code
the assumption that the final operation is a write op.
So we can just massively simplify this with a call/ret pair instead of
awkwardly exporting and then jumping back to the return label. This actually
collapses FINISH down into just a plain RET, since the op kernels already
don't set up any extra stack frame.
Signed-off-by: Niklas Haas <git@haasn.dev>
ff_vk_find_struct returns const void *, so storing it in const void *drm_create_pnext
fixes the initialization warning but then dpb_hwfc->create_pnext = drm_create_pnext
assigns const void * to void *, triggering the same warning at that line. The right
fix is a (void *) cast at the call site, same as done for buf_pnext.
Also restrict the GetPhysicalDeviceImageFormatProperties2 verbose log in
try_export_flags to the DRM modifier path only: when has_mods is false the log
always printed mod[0]=0x0, which is misleading since no DRM modifier is involved.
Signed-off-by: Tymur Boiko <tboiko@nvidia.com>
HLS EVENT playlists (e.g. Twitch VODs) are seekable but not finished,
so live_start_index causes playback to begin near the end. The first
packet's DTS then becomes first_timestamp, creating a wrong mapping
between timestamps and segments.
Fix this by subtracting the cumulative duration of skipped segments from
first_timestamp so it reflects the true start of the playlist.
Also set per-stream start_time from first_timestamp so correct time is
reported, reset pts_wrap_reference on seek to prevent bogus wrap
arounds.
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
Fixes a memory leak caused by AV_MEDIA_TYPE_VIDEO == 0 being excluded by
the !pool->type check. We can just remove the entire check because
av_buffer_pool_uninit() is already safe on NULL.
Fixes: fe2691b3bb
Reported-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Niklas Haas <git@haasn.dev>
This reduces the number of malloc() & free() calls, and structures the
data for the buffers a bit neatly.
In case more per-buffer data needs to be added, having a separate struct
is useful.
Signed-off-by: Alexandru Ardelean <aardelean@deviqon.com>
In the loop which allocates the buffers for a V4L2 device, if failure
occurs for a certain buffer (e.g. 3rd of 4 buffers), then the previously
allocated buffers (and the buffer array) would not be free'd in
the mmap_init(). This would cause a leak.
This change handles the error cases of that loop to free all allocated
resources, so that when mmap_init() fails nothing is leaked.
Signed-off-by: Alexandru Ardelean <aardelean@deviqon.com>
As a consequence of the fact that the frame pool API doesn't let us directly
access the linesize, we have to "un-translate" the over_read/write back to
the nearest multiple of the pixel size.
Signed-off-by: Niklas Haas <git@haasn.dev>
Allows the pass buffer allocator to make smarter decisions based on the actual
alignment requirements of the specific pass.
Signed-off-by: Niklas Haas <git@haasn.dev>
Matches the semantics of sws_frame_begin(), which also cleans up any
allocated buffers on error.
This is an issue introduced by the commit that allowed ff_sws_graph_run()
to fail in the first place.
Fixes: 563cc8216b
The major consequence of this is that we start allocating buffers per plane,
instead of allocating one contiguous buffer. This makes the no-op/refcopy
case slightly slower, but doesn't meaningfully affect the rest:
yuva444p -> yuva444p, time=157/1000 us (ref=78/1000 us), speedup=0.497x slower
Overall speedup=1.016x faster, min=0.983x max=1.092x
However, this is a necessary consequence of the desire to allow partial plane
allocations / single plane refcopies. This slowdown also does not affect
vf_scale, which already uses avfilter/framepool.c (via ff_get_video_buffer).
Signed-off-by: Niklas Haas <git@haasn.dev>
Saves a pointless free/alloc cycle on reinit. For the vast majority of filter
links, this going to be allocated anyway; and on the occasions that it's not,
the waste is marginal.
Signed-off-by: Niklas Haas <git@haasn.dev>
As per the FFmpeg coding style guidelines, braces should be avoided on
isolated single-line statement bodies.
Signed-off-by: Niklas Haas <git@haasn.dev>
FFALIGN(..., pool->align) = (...) & ~(pool->align - 1), so this condition
equates to: ((...) & ~(align - 1) & (align - 1)), which is trivially 0.
(Note that all expressions are of type `int`)
Signed-off-by: Niklas Haas <git@haasn.dev>
This struct is overally pretty trivial and there is little to no internal
state or invariants that need to be protected.
Making it public allows e.g. libswscale to allocate buffers for individual
planes directly.
Signed-off-by: Niklas Haas <git@haasn.dev>
Replacing the generic `int format` field. This aids in debugging, as
e.g. gdb will tend to translate the strongly typed enums back into human
readable names automatically.
Signed-off-by: Niklas Haas <git@haasn.dev>