612 Commits

Author SHA1 Message Date
Andreas Rheinhardt
e1782fb016 avutil/x86/pixelutils: Don't use mmx in 8x8 SAD
This function is exported, so has to abide by the ABI
and therefore issues emms since commit
5b85ca5317. Yet this is
expensive and using SSE2 instead improves performance.
Also avoid the initial zeroing and the last pointer
increment while just at it.
This removes the last usage of mmx from libavutil*.

Old benchmarks:
sad_8x8_0_c:                                            13.2 ( 1.00x)
sad_8x8_0_mmxext:                                       27.8 ( 0.48x)
sad_8x8_1_c:                                            13.2 ( 1.00x)
sad_8x8_1_mmxext:                                       27.6 ( 0.48x)
sad_8x8_2_c:                                            13.3 ( 1.00x)
sad_8x8_2_mmxext:                                       27.6 ( 0.48x)

New benchmarks:
sad_8x8_0_c:                                            13.3 ( 1.00x)
sad_8x8_0_sse2:                                         11.7 ( 1.13x)
sad_8x8_1_c:                                            13.8 ( 1.00x)
sad_8x8_1_sse2:                                         11.6 ( 1.20x)
sad_8x8_2_c:                                            13.2 ( 1.00x)
sad_8x8_2_sse2:                                         11.8 ( 1.12x)

Hint: Using two psadbw or one psadbw and movhps made no difference
in the benchmarks, so I chose the latter due to smaller codesize.

*: except if lavu provides avpriv_emms for other libraries

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-04-18 21:21:11 +02:00
Kacper Michajłow
7d57621b83 avutil/x86/x86util: tone down NASM workaround and use info section
The use of code section (.text) was forced by the unreleased NASM
3.02rc3 which made the issue worse, but preventing assambling anything
without code section, including when only data was present.

This works fine for the most part, but using code (.text) section with
IMAGE_COMDAT_SELECT_ANY causes issues with lib.exe after stripping such
object:
fatal error LNK1143: invalid or corrupt file: no symbol for COMDAT section 0x2

Esentially it makes our workaround not work in all cases, and while
string could be disabled like it already is for MSVC/ICL builds, it used
to work so let's preserve that state.

This make it not compatible with NASM 3.02rc3 when CV debug info is
generated, but hopefully the upstream fix will be merged before release,
to avoid this regression:
https://github.com/netwide-assembler/nasm/pull/221

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2026-03-30 19:46:53 +02:00
Kacper Michajłow
e54e117998 avutil/x86/x86util: define .text section additionally to COMDAT one
This is needed to cover the case when assembled source doesn't have
.text section. NASM documentation suggest to add $ suffix to section
name for COMDAT in .text, but this actually requires the main .text
section to exist also. And use less generic suffix for our dummy
sub-section.

Third time's the charm.

Fixes: 80cd067715
Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2026-03-30 01:08:45 +02:00
Kacper Michajłow
80cd067715 avutil/x86util: don't produce empty object files on win{32,64}
In cases when preprocesor would remove all code, nasm would produce
empty object files. This is technically not wrong, but often cause
issues with various tooling:

* NASM fails to emit CodeView debug info when there is no code [1]
* Older VS2022 builds hangs on empty files [2]
* GNU binutils `strip` errors when there is no sections [3]
error: the input file '.o' has no sections

Workaround those issues by adding dummy byte in COMDAT section,
which is then dropped by linker, as the `__x86util_notref` symbol is not
referenced from C. [4] IMAGE_COMDAT_SELECT_ANY (2) is used to allow
multiple symbol definition.

This is limited to win{32,64} as this is the target where issues were
observed.

[1] https://github.com/netwide-assembler/nasm/issues/216
[2] https://developercommunity.visualstudio.com/t/MSVC-Hangs-when-compiling-ffmpeg-When-l/10233953
[3] https://trac.ffmpeg.org/ticket/6711
[4] https://www.nasm.us/docs/3.01/nasm09.html#section-9.6.1

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
2026-03-29 23:00:06 +02:00
Andreas Rheinhardt
5c88f46c92 avutil/x86/aes: Only assemble iff HAVE_AESNI_EXTERNAL
This avoids relying on DCE and works around a NASM bug [1].

[1]: https://github.com/netwide-assembler/nasm/issues/216

Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-28 23:25:54 +01:00
Shreesh Adiga
952e588600 avutil/crc: refactor helper functions to separate header file
Move the reverse and xnmodp functions to a separate header
so that it can be reused for aarch64 implementation of av_crc.
2026-03-11 14:03:36 +00:00
Andreas Rheinhardt
e114c63234 avutil/x86/pixelutils: Avoid near-empty header
lavu/x86/pixelutils.h only declares exactly one function,
namely the arch-specific init function. Such declarations
are usually contained in the ordinary header providing
the generic init function, yet the latter is public in this case.

Given that said function is called from exactly one callsite,
the header can be made more useful by moving the actual x86-init
function to it (as a static inline function) and removing
pixelutils_init.c.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-09 10:17:26 +01:00
Andreas Rheinhardt
c9e056bc85 avutil/x86/pixelutils: Remove pointless AVX2 sad32x32 functions
Memory operands of VEX encoded instructions generally have
no alignment requirement and so can be used in the case where
both inputs are unaligned, too. Furthermore, unaligned load
instructions are as fast as aligned loads (from aligned addresses)
for modern cpus, in particular those with AVX2.

Therefore it makes no sense to have three different AVX2 sad32x32
functions. So remove two of them (the remaining one is the same
as the old one where src1 was aligned and src2 was not).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-03-09 10:17:26 +01:00
Andreas Rheinhardt
dc03cffe9c avutil/crc: Use x86 clmul for CRC when available
Observed near 10x speedup on AMD Zen4 7950x:
av_crc_c:                                            22057.0 ( 1.00x)
av_crc_clmul:                                         2202.8 (10.01x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-04 15:49:30 +01:00
Shreesh Adiga
1b6571c765 avutil/crc: add x86 SSE4.2 clmul SIMD implementation for av_crc
Implemented the algorithm described in the paper titled
"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
by Intel.
It is not used yet; the integration will be added in a separate commit.

Observed near 10x speedup on AMD Zen4 7950x:
av_crc_c:                                            22057.0 ( 1.00x)
av_crc_clmul:                                         2202.8 (10.01x)
2026-01-04 15:49:30 +01:00
Shreesh Adiga
e382772e4a avutil/cpu: add x86 CPU feature flag for clmul
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-04 15:49:30 +01:00
Zhao Zhili
1e2d86201f Revert "avutil/tx_template: extend to 2M"
This reverts commit 8f48a62, 9af8782, and bd3e71b.

Commit 8f48a62 extends tx to 2M, resulting in the tx_float bss
section reaching a size of 4M.

This isn't a issue on devices with normal memory sizes and OS
supporting virtual memory. But it's a real issue for embedded devices
with realtime OS, which may not support virtual memory, e.g., Nuttx.
This 4M of bss section map to physical memory directly, which is a
scarce resource on embedded devices.
2025-12-13 15:14:38 +00:00
Andreas Rheinhardt
59d75bf9e4 avutil/x86/Makefile: Only compile ASM init files when X86ASM is enabled
To do so, simply add these init files to X86ASM-OBJS instead of OBJS
in the Makefile. The former is already used for the actual assembly
files, but using them for the C init files just works, because the build
system uses file extensions to derive whether it is a C or a NASM file.

This avoids compiling unused function stubs and also reduces our
reliance on DCE: We don't add %if checks to the asm files except
for AVX, AVX2, FMA3, FMA4, XOP and AVX512, so all the MMX-SSE4
functions will be available. It also allows to remove HAVE_X86ASM checks
in these init files.

Reviewed-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-30 22:20:13 +01:00
Andreas Rheinhardt
0ec9c1b68d avutil/x86/x86inc: Use parentheses in has_epilogue
Prevents surprises.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-30 00:15:43 +01:00
Andreas Rheinhardt
5bf57a925c avutil/x86/asm: Remove wrong comment, rename FF_REG_sp
Before FFmpeg commit 531b0a316b,
FFmpeg used REG_SP as macro for the stack pointer, yet this
clashed with a REG_SP define in Solaris system headers, so it
was changed to REG_sp and a comment was added for this.

Libav fixed it by adding an FF_ prefix to the macros in
1e9c5bf4c1. FFmpeg switched
to using these prefixes in 9eb3da2f99,
using FF_REG_sp instead of Libav's FF_REG_SP. In said commit
the comment was changed to claim that Solaris system headers
define FF_REG_SP, but this is (most likely) wrong.

This commit removes the wrong comment and renames the (actually unused)
macro to FF_REG_SP to make it consistent with FF_REG_BP.

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-11-18 20:41:13 +01:00
Andreas Rheinhardt
7b5b29910a avutil/x86/cpu: Remove 3dnow flags, macros
Unused since 5ef613bcb0.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2025-10-25 07:27:11 +02:00
Alan Kelly
f4b044bbe3 swscale: Disable avx2 hscale 8to15 on IceLake and below due to Intel Gather Data Sampling mitigation performance loss
Intel provided a microcode update to mitigate this security
    vulnerability which has a huge negative performance impact on gather
    instructions. This means that hscale 8to15 avx2, which uses gather
    extensively, is no longer faster than SSSE3 on impacted CPUs.

    https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/gather-data-sampling.html

    Broadwell:
    hscale_8_to_15__fs_4_dstW_512_c:                      3379.5 ( 1.00x)
    hscale_8_to_15__fs_4_dstW_512_sse2:                    615.7 ( 5.49x)
    hscale_8_to_15__fs_4_dstW_512_ssse3:                   613.4 ( 5.51x)
    hscale_8_to_15__fs_4_dstW_512_avx2:                    495.7 ( 6.82x)

    Skylake:
    hscale_8_to_15__fs_4_dstW_512_c:                      3411.4 ( 1.00x)
    hscale_8_to_15__fs_4_dstW_512_sse2:                    591.0 ( 5.77x)
    hscale_8_to_15__fs_4_dstW_512_ssse3:                   591.5 ( 5.77x)
    hscale_8_to_15__fs_4_dstW_512_avx2:                   1386.2 ( 2.46x)

    Cascade Lake:
    hscale_8_to_15__fs_4_dstW_512_c:                      3231.3 ( 1.00x)
    hscale_8_to_15__fs_4_dstW_512_sse2:                    517.9 ( 6.24x)
    hscale_8_to_15__fs_4_dstW_512_ssse3:                   521.6 ( 6.19x)
    hscale_8_to_15__fs_4_dstW_512_avx2:                   1775.0 ( 1.82x)

    Sapphire Rapids:
    hscale_8_to_15__fs_4_dstW_512_c:                      1840.0 ( 1.00x)
    hscale_8_to_15__fs_4_dstW_512_sse2:                    287.9 ( 6.39x)
    hscale_8_to_15__fs_4_dstW_512_ssse3:                   293.8 ( 6.26x)
    hscale_8_to_15__fs_4_dstW_512_avx2:                    219.2 ( 8.40x)
2025-09-06 20:57:48 +00:00
Kacper Michajłow
43dc443446 avutil/intmath: use AV_HAS_BUILTIN to detect builtin availability
Fixes use of bultins on clang x86_64-pc-windows-msvc which does not
define any __GNUC__. Also on other targets __GNUC__ is defined to 4 by
default, so any feature testing based on version is not really valid.

Signed-off-by: Kacper Michajłow <kasper93@gmail.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-06-12 14:17:37 +03:00
James Almer
b7fc195e7a avutil/x86/intmath: remove inline asm implementations for clip functions
GCC/Clang is smart enough to emit minss/maxss the same way as these functions.
The only theoretical benefit was in x86_32, where x87 floats are used, but the
penalty of making the clipping opaque to the compiler's scheduler plus moving
values from mmx regs to xmm and back will offset any potential speedup.
x86_32 builds targetting anything made in the last two decades and a half
should use -msse -mfp=sse anyway.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-06-07 21:14:55 -03:00
James Almer
a039726c2a avutil/x86/aes: remove a few branches
The rounds value is constant and can be one of three hardcoded values, so
instead of checking it on every loop, just split the function into three
different implementations for each value.

Before:
aes_decrypt_128_aesni:                                  93.8 (47.58x)
aes_decrypt_192_aesni:                                 106.9 (49.30x)
aes_decrypt_256_aesni:                                 109.8 (56.50x)
aes_encrypt_128_aesni:                                  93.2 (47.70x)
aes_encrypt_192_aesni:                                 111.1 (48.36x)
aes_encrypt_256_aesni:                                 113.6 (56.27x)

After:
aes_decrypt_128_aesni:                                  71.5 (63.31x)
aes_decrypt_192_aesni:                                  96.8 (55.64x)
aes_decrypt_256_aesni:                                 106.1 (58.51x)
aes_encrypt_128_aesni:                                  81.3 (55.92x)
aes_encrypt_192_aesni:                                  91.2 (59.78x)
aes_encrypt_256_aesni:                                 109.0 (58.26x)

Signed-off-by: James Almer <jamrial@gmail.com>
2025-04-10 12:02:34 -03:00
James Almer
a35b4e8d29 avutil/x86/aes: ignore the upper bits in count
The argument is an int.

Signed-off-by: James Almer <jamrial@gmail.com>
2025-04-06 11:02:09 -03:00
Rodger Combs
2ea3c51795 lavu/aes: add x86 AESNI optimizations
crypto_bench comparison for AES-128-ECB:

lavu_aesni AES-128-ECB  size: 1048576  runs:   1024  time:    0.596 +- 0.081
lavu_c     AES-128-ECB  size: 1048576  runs:   1024  time:   17.007 +- 2.131
crypto     AES-128-ECB  size: 1048576  runs:   1024  time:    0.612 +- 1.857
gcrypt     AES-128-ECB  size: 1048576  runs:   1024  time:    1.123 +- 0.224
tomcrypt   AES-128-ECB  size: 1048576  runs:   1024  time:    9.038 +- 0.790

Improved-By: Henrik Gramner <henrik@gramner.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2025-04-05 20:46:40 -03:00
Lynne
892f64ad9b x86/tx_float: remove HAVE_AVX2_EXTERNAL checks
It'll always be enabled.
Thanks, nasm.
2024-10-06 01:32:49 +02:00
Lynne
b17a240c8d Revert "x86/tx_float: set all operands for shufps"
This reverts commit 74f5fb6db8.
2024-10-06 01:32:49 +02:00
Lynne
24c5a58e55 Revert "x86/tx_float: add missing check for AVX2"
This reverts commit f4097e4c1f.
2024-10-06 01:32:48 +02:00
Lynne
bf643f989b Revert "x86/tx_float: add missing preprocessor wrapper for AVX2 functions"
This reverts commit 750f378bec.
2024-10-06 01:32:48 +02:00
Lynne
b890482d05 Revert "x86/tx_float: change a condition in a preprocessor check"
This reverts commit 0d8f43c74d.
2024-10-06 01:32:47 +02:00
James Almer
9e7a93c6fd x86/intreadwrite: add SSE2 optimized AV_COPY128U
Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-29 23:17:52 -03:00
James Almer
70c6b904be x86/intreadwrite: add missing casts to pointer arguments
Should make strict compilers happy.

Also, make AV_COPY128 use integer operations while at it. Removing the
inclusion of immintrin.h ensures a lot less intrinsic related headers are
included as well, which fixes a clash of defines with some Clang versions.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-11 18:24:26 -03:00
James Almer
1a86a7a48d x86/intreadwrite: fix include of config.h
Should fix make checkheaders.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:52:52 -03:00
James Almer
15056dd650 x86/intreadwrite.h: add missing preprocessor checks
Removed by accident in the previous commits. This makes the code only run when
compiled with GCC and Clang like before. Support for other compilers like msvc
can be added later.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:49:21 -03:00
James Almer
bd1bcb07e0 x86/intreadwrite: use intrinsics instead of inline asm for AV_COPY128
This has the benefit of removing any SSE -> AVX penalty that may happen when
the compiler emits VEX encoded instructions.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:25:44 -03:00
James Almer
4a04cca69a x86/intreadwrite: use intrinsics instead of inline asm for AV_ZERO128
When called inside a loop, the inline asm version results in one pxor
unnecessarely emitted per iteration, as the contents of the __asm__() block are
opaque to the compiler's instruction scheduler.
This is not the case with intrinsics, where pxor will be emitted once with any
half decent compiler.

This also has the benefit of removing any SSE -> AVX penalty that may happen
when the compiler emits VEX encoded instructions.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-10 13:25:44 -03:00
James Almer
4b57ea8fc7 avutil/common: assert that bit position in av_zero_extend is valid
Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-13 20:36:09 -03:00
James Almer
39c90d6466 avutil: rename av_mod_uintp2 to av_zero_extend
It's more descriptive of what it does.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-13 20:35:57 -03:00
Rémi Denis-Courmont
0231097d1b lavu/x86: remove GCC 4.4- stuff
Since the C11 support is required, those GCC versions can no longer be
supported anyhow. (Clang pretends to be GCC 4.4, but it looks like the
code was intended for old GCC specifically.)
2024-06-13 21:16:16 +03:00
James Almer
a14440867c x86/float_dsp: add SSE2 and AVX versions of scalarproduct_double
Signed-off-by: James Almer <jamrial@gmail.com>
2024-06-03 22:14:55 -03:00
Andreas Rheinhardt
790f793844 avutil/common: Don't auto-include mem.h
There are lots of files that don't need it: The number of object
files that actually need it went down from 2011 to 884 here.

Keep it for external users in order to not cause breakages.

Also improve the other headers a bit while just at it.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-31 00:08:43 +01:00
Henrik Gramner
afa471d0ef x86: Update x86inc.asm
Make things up-to-date with upstream.

https://code.videolan.org/videolan/x86inc.asm
2024-03-24 14:53:57 +01:00
Henrik Gramner
c3d3f0e697 avutil/x86util: Fix broken pre-SSE4.1 PMINSD emulation
Fixes yadif-16 which allows FATE to pass.

Broken since 2904db9045 (2017).
2024-03-17 13:52:27 +01:00
Andreas Rheinhardt
c00cd007e8 configure: Remove av_restrict
All versions of MSVC that support C11 (namely >= v19.27)
also support the restrict keyword, therefore av_restrict
is no longer necessary since 75697836b1.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-15 12:51:15 +01:00
Martin Storsjö
7ec2354c38 x86: Remove inline MMX assembly that clobbers the FPU state
These inline implementations of AV_COPY64, AV_SWAP64 and AV_ZERO64
are known to clobber the FPU state - which has to be restored
with the 'emms' instruction afterwards.

This was known and signaled with the FF_COPY_SWAP_ZERO_USES_MMX
define, which calling code seems to have been supposed to check,
in order to call emms_c() after using them. See
0b1972d409,
29c4c0886d and
df215e5758 for history on earlier
fixes in the same area.

However, new code can use these AV_*64() macros without knowing
about the need to call emms_c().

Just get rid of these dangerous inline assembly snippets; this
doesn't make any difference for 64 bit architectures anyway.

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-02-09 23:55:52 +02:00
Lynne
9af87828bd x86/tx_init: propely indicate the extended available transform sizes
Forgot to do this with the previous commit.

Actually makes the assembly being used.

Still the fastest FFT in the world, 15% faster than FFTW on the
largest available size.
2024-02-09 18:08:42 +01:00
Lynne
bd3e71b21e x86/tx_float: enable SIMD for sizes over 131072
The tables for the new sizes were added last year due
to being required for SDR.
However, the assembly was never updated to use them.
2024-02-07 15:20:48 +01:00
Henrik Gramner
ed8ddf0bd3 x86inc: Add REPX macro to repeat instructions/operations
When operating on large blocks of data it's common to repeatedly use
an instruction on multiple registers. Using the REPX macro makes it
easy to quickly write dense code to achieve this without having to
explicitly duplicate the same instruction over and over.

For example,

    REPX {paddw x, m4}, m0, m1, m2, m3
    REPX {mova [r0+16*x], m5}, 0, 1, 2, 3

will expand to

    paddw       m0, m4
    paddw       m1, m4
    paddw       m2, m4
    paddw       m3, m4
    mova [r0+16*0], m5
    mova [r0+16*1], m5
    mova [r0+16*2], m5
    mova [r0+16*3], m5

Commit taken from x264:
6d10612ab0

Signed-off-by: Frank Plowman <post@frankplowman.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2023-11-08 13:49:08 +01:00
Andreas Rheinhardt
5b85ca5317 avutil/x86/pixelutils: Empty MMX state in ff_pixelutils_sad_8x8_mmxext
We currently mostly do not empty the MMX state in our MMX
DSP functions; instead we only do so before code that might
be using x87 code. This is a violation of the System V i386 ABI
(and maybe of other ABIs, too):
"The CPU shall be in x87 mode upon entry to a function. Therefore,
every function that uses the MMX registers is required to issue an
emms or femms instruction after using MMX registers, before returning
or calling another function." (See 2.2.1 in [1])
This patch does not intend to change all these functions to abide
by the ABI; it only does so for ff_pixelutils_sad_8x8_mmxext, as this
function can by called by external users, because it is exported
via the pixelutils API. Without this, the following fragment will
assert (on x86/x64):
    uint8_t src1[8 * 8], src2[8 * 8];
    av_pixelutils_sad_fn fn = av_pixelutils_get_sad_fn(3, 3, 0, NULL);
    fn(src1, 8, src2, 8);
    av_assert0_fpu();

[1]: https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/intel386-psABI-1.1.pdf

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-11-04 01:26:03 +01:00
Andreas Rheinhardt
f8503b4c33 avutil/internal: Don't auto-include emms.h
Instead include emms.h wherever it is needed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2023-09-04 11:04:45 +02:00
Lynne
bbe95f7353 x86: replace explicit REP_RETs with RETs
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)

x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.

The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.

In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
Lynne
90c17a05aa x86/tx_float: fix stray change in 15xM FFT and replace imul->lea
Thanks to rorgoroth for bisecting and kurosu for the lea suggestion.
2022-11-28 16:58:12 +01:00
Lynne
87bae6b018 lavu/tx: refactor to explicitly track and convert lookup table order
Necessary for generalizing PFAs.
2022-11-24 15:58:34 +01:00