Allocate it via cglobal as usual. This makes the SSE2/AVX functions
available when HAVE_ALIGNED_STACK is false; it also avoids
modifying rsp unnecessarily in the deblock_h_luma_intra_10 functions
on Win64.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
They are a remnant of the MMX functions (which processed
only eight pixels at a time, so that it was called twice
via a wrapper; the actual MMX function had "v8" in its name
instead of simply v) which have been removed in commit
4618f36a24.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Forgotten in 4618f36a24.
Also remove a PASS8ROWS wrapper that seems to have been always
unused.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Implement NEON optimization for HEVC dequant at 8-bit depth.
The NEON implementation uses srshr (Signed Rounding Shift Right) which
does both the add with offset and right shift in a single instruction.
Optimization details:
- 4x4 (16 coeffs): Single load-process-store sequence
- 8x8 (64 coeffs): Fully unrolled, no loop overhead
- 16x16 (256 coeffs): Pipelined load/compute/store to hide memory latency
- 32x32 (1024 coeffs): Pipelined with all available NEON registers
Performance benchmark on Apple M4:
./tests/checkasm/checkasm --test=hevc_dequant --bench
hevc_dequant_4x4_8_c: 11.3 ( 1.00x)
hevc_dequant_4x4_8_neon: 6.3 ( 1.78x)
hevc_dequant_8x8_8_c: 33.9 ( 1.00x)
hevc_dequant_8x8_8_neon: 6.6 ( 5.11x)
hevc_dequant_16x16_8_c: 153.8 ( 1.00x)
hevc_dequant_16x16_8_neon: 9.0 (17.02x)
hevc_dequant_32x32_8_c: 78.1 ( 1.00x)
hevc_dequant_32x32_8_neon: 31.9 ( 2.45x)
Note on Performance Anomaly:
The observation that hevc_dequant_32x32_8_c is faster than 16x16 (78.1 vs 153.8)
is due to Clang auto-vectorizing only for sizes >= 32x32.
Compiler: Apple clang version 17.0.0 (clang-1700.6.3.2)
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Previously x265 encoder used its default log level regardless of
FFmpeg's log level setting. Note the log level can be overwritten
by x265-params.
Fix#21462
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
This issue hid under the radar since the codebooks between coupling
modes very often result in identical bit counts regardless of the encoded
data, leading to no frame-level bitstream desyncs except in rare cases.
AAC Mps212 data is parsed immediately after the SBR data, where a loss
of sync in SBR will result in Mps212 being wildly different.
Move motion estimation precision check from standalone
`d3d12va_encode_init_motion_estimation_precision()` function into each
codec's init_sequence_params() to reuse existing feature support
queries.
- fixes AV1 using wrong support structure (SUPPORT instead of SUPPORT1)
- eliminates duplicate setup code
- removes redundant CheckFeatureSupport API call
- no intended functional changes other than bug fix
The spec says:
pMiColStarts is a pointer to an array of TileCols number
of unsigned integers that corresponds to MiColStarts
defined in section 6.8.14 of the [AV1 Specification]
And the unit of MiColStarts is MI(mode info).
So is pMiRowStarts.
Add support for standard -pass and -passlogfile options, matching the behavior
of libx264.
Add the -x265-stats option to specify the stats filename.
Update documentation.
Signed-off-by: Werner Robitza <werner.robitza@gmail.com>
Previously we used the codec or at the time of decoding fb_priv for this, but
fb_priv can be nonzero even if an external frame buffer is not set, so it's
cleaner to use the capability flag directly.
Also check the result of vpx_codec_set_frame_buffer_functions.
Signed-off-by: Marton Balint <cus@passwd.hu>
It is possible that the error happens with the alpha encoder, not the normal
one, so let's always pass the affected encoder to the logging function.
Signed-off-by: Marton Balint <cus@passwd.hu>
The main reason this was written was due to Nvidia. Nvidia always
has a fickle upload path, and seemed to have a shortcut for the
host image upload path. This seems to have been patched out of
recent driver versions.
This upload path relies on the driver keeping the same layout,
down to the stride for the images. Which is an assumption that's
not portable.
Rather than relying on this fickle upload path, what we'd like when
we want pure bandwidth is to decouple uploads to a separate queue,
and let the GPU pull the data from RAM via uploads.
It'll be slower with a single-threaded decoder, but currently all
of our compute-based decoders and the decoders that sit underneath
them support frame threading.
When prescale is enabled, time fields are converted to the output
timebase before expression evaluation. This allows option specification
even if the input timebase is unknown.
The setts bsf has an option to change TB. However the filter only
changed the TB and did not rescale the ts and duration, so it
effectively and silently stretched or squeezed the stream.
The pts, dts and duration are now rescaled to maintain temporal fidelity.
ff_mlp_restart_checksum() used the (undocumented) layout
of the CRC tables and therefore broke on x86 when the
clmul implementation added in dc03cffe9c
is used. This commit fixes this (and issue #21506).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
fix a simple index bug in ff_aac_usac_reset_state()
that writes past the end of ChannelElement.ch[2] for CPE
ff_aac_usac_reset_state() loops over channels with j < ch, but
incorrectly takes &che->ch[ch]. For CPE (ch == 2) this becomes
che->ch[2], which is one past the end of ChannelElement.ch[2], and the
subsequent memset() causes an intra-object out-of-bounds write.
index the channel element with the loop variable (j).
Extract the Sample Aspect Ratio (SAR) from render_width_minus_1 and
render_height_minus_1 in the sequence header.
The AV1 specification defines the render dimensions, which can be used
in conjunction with the coded dimensions to determine the pixel aspect
ratio. This ensures consistent aspect ratio handling for AV1 streams
encapsulated in containers like MP4 or MKV, as observed in the updated
FATE tests where SAR changes from 0/1 to 1/1.
Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Its rarely respected by implementations, its fairly new (1 year old),
and it has a scuffed define (neither glslc nor glslang enable the
"GL_EXT_nontemporal_keyword" define if its enabled, unlike all other extensions).