FFmpeg

mirror of https://mirror.skon.top/https://github.com/FFmpeg/FFmpeg synced 2026-04-22 05:40:27 +08:00

Author	SHA1	Message	Date
Niklas Haas	e20a32d730	swscale/x86/ops: align linear kernels with reference backend See previous commit. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-16 23:24:55 +02:00
Niklas Haas	af2674645f	swscale/ops: drop offset from SWS_MASK_ALPHA This is far more commonly used without an offset than with; so having it there prevents these special cases from actually doing much good. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-16 23:24:55 +02:00
Niklas Haas	526195e0a3	swscale/x86/ops_float: fix typo in linear_row First vector is %2, not %3. This was never triggered before because all of the existing masks never hit this exact case. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-04-16 23:24:55 +02:00
Niklas Haas	2ef01689c4	swscale/x86/ops: add 4x4 transposed kernel for large filters Above a certain filter size, we can load the offsets as scalars and loop over filter taps instead. To avoid having to assemble the output register in memory (or use some horrific sequence of blends and insertions), we process 4 adjacent pixels at a time and do a 4x4 transpose before accumulating the weights. Significantly faster than the existing kernels after 2-3 iterations. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-03-28 18:50:14 +01:00
Niklas Haas	4bf51d6615	swscale/x86/ops: add reference SWS_OP_FILTER_H implementation This uses a naive gather-based loop, similar to the existing legacy hscale SIMD. This has provably correct semantics (and avoids overflow as long as the filter scale is 1 << 14 or so), though it's not particularly fast for larger filter sizes. We can specialize this to more efficient implementations in a subset of cases, but for now, this guarantees a match to the C code. Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2026-03-28 18:50:14 +01:00
Niklas Haas	568cdca9cc	swscale/x86/ops: implement support for SWS_OP_FILTER_V Ideally, we would like to be able to specialize these to fixed kernel sizes as well (e.g. 2 taps), but that only saves a tiny bit of loop overhead and at the moment I have more pressing things to focus on. I found that using FMA instead of straight mulps/addps gains about 15%, so I defined a separate FMA path that can be used when BITEXACT is not specified (or when we can statically guarantee that the final sum fits into the floating point range). Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2026-03-28 18:50:14 +01:00
Niklas Haas	baac4a1174	swscale/x86/ops: add section comments (cosmetic) Sponsored-by: Sovereign Tech Fund Signed-off-by: Niklas Haas <git@haasn.dev>	2026-03-12 21:02:48 +00:00
Niklas Haas	ce096aa4ee	swscale/x86/ops: add support for optional dither indices Instead of defining multiple patterns for the dither ops, just define a single generic function that branches internally. The branch is well-predicted and ridiculously cheap. At least on my end, within margin of error. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-02-26 13:09:14 +00:00
Niklas Haas	48ab318f5c	swscale/x86/ops: don't preload dither weights This doesn't actually gain any performance but makes the code needlessly complicated. Just directly add the indirect address as needed. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-02-26 13:09:14 +00:00
Niklas Haas	1ec8e6e3ce	swscale/x86/ops: split off dither0 special case I want to rewrite the dither kernel a bit, and this special case is a bit too annoying and gets in the way. Signed-off-by: Niklas Haas <git@haasn.dev>	2026-02-26 13:09:14 +00:00
Niklas Haas	3f7e3cedb5	swscale/x86/ops_float: store and load per row dither offset directly Instead of computing y + N with a hard-coded index offset, calculate the relative offset as a 16-bit integer in C and add that to the pointer directly. Since we no longer mask the resulting combined address, this may result in overread, but that's fine since we over-provisioned the array in the previous commit.	2025-12-15 14:31:58 +00:00
Niklas Haas	b1c96b99fa	swscale/x86/ops_float: remove special case for 2x2 matrix This is an exceptionally unlikely (in fact, currently impossible) case to actually hit, and not worth micro-optimizing for. More specifically, having this special case prevents me from easily adding per-row offsets.	2025-12-15 14:31:58 +00:00
Niklas Haas	982d3a98d0	swscale/x86: add SIMD backend This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all floating point operations. While this is not yet 100% coverage, it's good enough for the vast majority of formats out there. Of special note is the packed shuffle fast path, which uses pshufb at vector sizes up to AVX512.	2025-09-01 19:28:36 +02:00

13 Commits