This function is exported, so has to abide by the ABI
and therefore issues emms since commit
5b85ca5317. Yet this is
expensive and using SSE2 instead improves performance.
Also avoid the initial zeroing and the last pointer
increment while just at it.
This removes the last usage of mmx from libavutil*.
Old benchmarks:
sad_8x8_0_c: 13.2 ( 1.00x)
sad_8x8_0_mmxext: 27.8 ( 0.48x)
sad_8x8_1_c: 13.2 ( 1.00x)
sad_8x8_1_mmxext: 27.6 ( 0.48x)
sad_8x8_2_c: 13.3 ( 1.00x)
sad_8x8_2_mmxext: 27.6 ( 0.48x)
New benchmarks:
sad_8x8_0_c: 13.3 ( 1.00x)
sad_8x8_0_sse2: 11.7 ( 1.13x)
sad_8x8_1_c: 13.8 ( 1.00x)
sad_8x8_1_sse2: 11.6 ( 1.20x)
sad_8x8_2_c: 13.2 ( 1.00x)
sad_8x8_2_sse2: 11.8 ( 1.12x)
Hint: Using two psadbw or one psadbw and movhps made no difference
in the benchmarks, so I chose the latter due to smaller codesize.
*: except if lavu provides avpriv_emms for other libraries
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
The use of code section (.text) was forced by the unreleased NASM
3.02rc3 which made the issue worse, but preventing assambling anything
without code section, including when only data was present.
This works fine for the most part, but using code (.text) section with
IMAGE_COMDAT_SELECT_ANY causes issues with lib.exe after stripping such
object:
fatal error LNK1143: invalid or corrupt file: no symbol for COMDAT section 0x2
Esentially it makes our workaround not work in all cases, and while
string could be disabled like it already is for MSVC/ICL builds, it used
to work so let's preserve that state.
This make it not compatible with NASM 3.02rc3 when CV debug info is
generated, but hopefully the upstream fix will be merged before release,
to avoid this regression:
https://github.com/netwide-assembler/nasm/pull/221
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
This is needed to cover the case when assembled source doesn't have
.text section. NASM documentation suggest to add $ suffix to section
name for COMDAT in .text, but this actually requires the main .text
section to exist also. And use less generic suffix for our dummy
sub-section.
Third time's the charm.
Fixes: 80cd067715
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
lavu/x86/pixelutils.h only declares exactly one function,
namely the arch-specific init function. Such declarations
are usually contained in the ordinary header providing
the generic init function, yet the latter is public in this case.
Given that said function is called from exactly one callsite,
the header can be made more useful by moving the actual x86-init
function to it (as a static inline function) and removing
pixelutils_init.c.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Memory operands of VEX encoded instructions generally have
no alignment requirement and so can be used in the case where
both inputs are unaligned, too. Furthermore, unaligned load
instructions are as fast as aligned loads (from aligned addresses)
for modern cpus, in particular those with AVX2.
Therefore it makes no sense to have three different AVX2 sad32x32
functions. So remove two of them (the remaining one is the same
as the old one where src1 was aligned and src2 was not).
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Implemented the algorithm described in the paper titled
"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
by Intel.
It is not used yet; the integration will be added in a separate commit.
Observed near 10x speedup on AMD Zen4 7950x:
av_crc_c: 22057.0 ( 1.00x)
av_crc_clmul: 2202.8 (10.01x)
This reverts commit 8f48a62, 9af8782, and bd3e71b.
Commit 8f48a62 extends tx to 2M, resulting in the tx_float bss
section reaching a size of 4M.
This isn't a issue on devices with normal memory sizes and OS
supporting virtual memory. But it's a real issue for embedded devices
with realtime OS, which may not support virtual memory, e.g., Nuttx.
This 4M of bss section map to physical memory directly, which is a
scarce resource on embedded devices.
To do so, simply add these init files to X86ASM-OBJS instead of OBJS
in the Makefile. The former is already used for the actual assembly
files, but using them for the C init files just works, because the build
system uses file extensions to derive whether it is a C or a NASM file.
This avoids compiling unused function stubs and also reduces our
reliance on DCE: We don't add %if checks to the asm files except
for AVX, AVX2, FMA3, FMA4, XOP and AVX512, so all the MMX-SSE4
functions will be available. It also allows to remove HAVE_X86ASM checks
in these init files.
Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Before FFmpeg commit 531b0a316b,
FFmpeg used REG_SP as macro for the stack pointer, yet this
clashed with a REG_SP define in Solaris system headers, so it
was changed to REG_sp and a comment was added for this.
Libav fixed it by adding an FF_ prefix to the macros in
1e9c5bf4c1. FFmpeg switched
to using these prefixes in 9eb3da2f99,
using FF_REG_sp instead of Libav's FF_REG_SP. In said commit
the comment was changed to claim that Solaris system headers
define FF_REG_SP, but this is (most likely) wrong.
This commit removes the wrong comment and renames the (actually unused)
macro to FF_REG_SP to make it consistent with FF_REG_BP.
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
Fixes use of bultins on clang x86_64-pc-windows-msvc which does not
define any __GNUC__. Also on other targets __GNUC__ is defined to 4 by
default, so any feature testing based on version is not really valid.
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
GCC/Clang is smart enough to emit minss/maxss the same way as these functions.
The only theoretical benefit was in x86_32, where x87 floats are used, but the
penalty of making the clipping opaque to the compiler's scheduler plus moving
values from mmx regs to xmm and back will offset any potential speedup.
x86_32 builds targetting anything made in the last two decades and a half
should use -msse -mfp=sse anyway.
Signed-off-by: James Almer <jamrial@gmail.com>
The rounds value is constant and can be one of three hardcoded values, so
instead of checking it on every loop, just split the function into three
different implementations for each value.
Before:
aes_decrypt_128_aesni: 93.8 (47.58x)
aes_decrypt_192_aesni: 106.9 (49.30x)
aes_decrypt_256_aesni: 109.8 (56.50x)
aes_encrypt_128_aesni: 93.2 (47.70x)
aes_encrypt_192_aesni: 111.1 (48.36x)
aes_encrypt_256_aesni: 113.6 (56.27x)
After:
aes_decrypt_128_aesni: 71.5 (63.31x)
aes_decrypt_192_aesni: 96.8 (55.64x)
aes_decrypt_256_aesni: 106.1 (58.51x)
aes_encrypt_128_aesni: 81.3 (55.92x)
aes_encrypt_192_aesni: 91.2 (59.78x)
aes_encrypt_256_aesni: 109.0 (58.26x)
Signed-off-by: James Almer <jamrial@gmail.com>
Should make strict compilers happy.
Also, make AV_COPY128 use integer operations while at it. Removing the
inclusion of immintrin.h ensures a lot less intrinsic related headers are
included as well, which fixes a clash of defines with some Clang versions.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: James Almer <jamrial@gmail.com>
Removed by accident in the previous commits. This makes the code only run when
compiled with GCC and Clang like before. Support for other compilers like msvc
can be added later.
Signed-off-by: James Almer <jamrial@gmail.com>
This has the benefit of removing any SSE -> AVX penalty that may happen when
the compiler emits VEX encoded instructions.
Signed-off-by: James Almer <jamrial@gmail.com>
When called inside a loop, the inline asm version results in one pxor
unnecessarely emitted per iteration, as the contents of the __asm__() block are
opaque to the compiler's instruction scheduler.
This is not the case with intrinsics, where pxor will be emitted once with any
half decent compiler.
This also has the benefit of removing any SSE -> AVX penalty that may happen
when the compiler emits VEX encoded instructions.
Signed-off-by: James Almer <jamrial@gmail.com>
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but it looks like the
code was intended for old GCC specifically.)
There are lots of files that don't need it: The number of object
files that actually need it went down from 2011 to 884 here.
Keep it for external users in order to not cause breakages.
Also improve the other headers a bit while just at it.
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
All versions of MSVC that support C11 (namely >= v19.27)
also support the restrict keyword, therefore av_restrict
is no longer necessary since 75697836b1.
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
These inline implementations of AV_COPY64, AV_SWAP64 and AV_ZERO64
are known to clobber the FPU state - which has to be restored
with the 'emms' instruction afterwards.
This was known and signaled with the FF_COPY_SWAP_ZERO_USES_MMX
define, which calling code seems to have been supposed to check,
in order to call emms_c() after using them. See
0b1972d409,
29c4c0886d and
df215e5758 for history on earlier
fixes in the same area.
However, new code can use these AV_*64() macros without knowing
about the need to call emms_c().
Just get rid of these dangerous inline assembly snippets; this
doesn't make any difference for 64 bit architectures anyway.
Signed-off-by: Martin Storsjö <martin@martin.st>
Forgot to do this with the previous commit.
Actually makes the assembly being used.
Still the fastest FFT in the world, 15% faster than FFTW on the
largest available size.
When operating on large blocks of data it's common to repeatedly use
an instruction on multiple registers. Using the REPX macro makes it
easy to quickly write dense code to achieve this without having to
explicitly duplicate the same instruction over and over.
For example,
REPX {paddw x, m4}, m0, m1, m2, m3
REPX {mova [r0+16*x], m5}, 0, 1, 2, 3
will expand to
paddw m0, m4
paddw m1, m4
paddw m2, m4
paddw m3, m4
mova [r0+16*0], m5
mova [r0+16*1], m5
mova [r0+16*2], m5
mova [r0+16*3], m5
Commit taken from x264:
6d10612ab0
Signed-off-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
We currently mostly do not empty the MMX state in our MMX
DSP functions; instead we only do so before code that might
be using x87 code. This is a violation of the System V i386 ABI
(and maybe of other ABIs, too):
"The CPU shall be in x87 mode upon entry to a function. Therefore,
every function that uses the MMX registers is required to issue an
emms or femms instruction after using MMX registers, before returning
or calling another function." (See 2.2.1 in [1])
This patch does not intend to change all these functions to abide
by the ABI; it only does so for ff_pixelutils_sad_8x8_mmxext, as this
function can by called by external users, because it is exported
via the pixelutils API. Without this, the following fragment will
assert (on x86/x64):
uint8_t src1[8 * 8], src2[8 * 8];
av_pixelutils_sad_fn fn = av_pixelutils_get_sad_fn(3, 3, 0, NULL);
fn(src1, 8, src2, 8);
av_assert0_fpu();
[1]: https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/intel386-psABI-1.1.pdf
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)
x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.
The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.
In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.