Files
FFmpeg/libavcodec
Andreas Rheinhardt 2729c52988 avcodec/x86/hevc/deblock: Reduce usage of GPRs
Don't use two GPRs to store two words from xmm registers;
shuffle these words so that they are fit into one GPR.
This reduces the amount of GPRs used and leads to tiny speedups
here. Also avoid rex prefixes whenever possible (for lines
that needed to be modified anyway).

Old benchmarks:
hevc_h_loop_filter_luma8_skip_c:                        23.8 ( 1.00x)
hevc_h_loop_filter_luma8_skip_sse2:                      8.5 ( 2.80x)
hevc_h_loop_filter_luma8_skip_ssse3:                     7.2 ( 3.29x)
hevc_h_loop_filter_luma8_skip_avx:                       6.4 ( 3.71x)
hevc_h_loop_filter_luma8_strong_c:                     150.4 ( 1.00x)
hevc_h_loop_filter_luma8_strong_sse2:                   34.4 ( 4.37x)
hevc_h_loop_filter_luma8_strong_ssse3:                  34.5 ( 4.36x)
hevc_h_loop_filter_luma8_strong_avx:                    32.3 ( 4.65x)
hevc_h_loop_filter_luma8_weak_c:                       103.2 ( 1.00x)
hevc_h_loop_filter_luma8_weak_sse2:                     34.5 ( 2.99x)
hevc_h_loop_filter_luma8_weak_ssse3:                     7.3 (14.22x)
hevc_h_loop_filter_luma8_weak_avx:                      32.4 ( 3.18x)
hevc_h_loop_filter_luma10_skip_c:                       23.5 ( 1.00x)
hevc_h_loop_filter_luma10_skip_sse2:                     6.6 ( 3.58x)
hevc_h_loop_filter_luma10_skip_ssse3:                    6.1 ( 3.86x)
hevc_h_loop_filter_luma10_skip_avx:                      5.4 ( 4.34x)
hevc_h_loop_filter_luma10_strong_c:                    161.8 ( 1.00x)
hevc_h_loop_filter_luma10_strong_sse2:                  32.2 ( 5.03x)
hevc_h_loop_filter_luma10_strong_ssse3:                 30.4 ( 5.33x)
hevc_h_loop_filter_luma10_strong_avx:                   30.3 ( 5.33x)
hevc_h_loop_filter_luma10_weak_c:                       23.5 ( 1.00x)
hevc_h_loop_filter_luma10_weak_sse2:                     6.6 ( 3.58x)
hevc_h_loop_filter_luma10_weak_ssse3:                    6.1 ( 3.85x)
hevc_h_loop_filter_luma10_weak_avx:                      5.4 ( 4.35x)
hevc_h_loop_filter_luma12_skip_c:                       18.8 ( 1.00x)
hevc_h_loop_filter_luma12_skip_sse2:                     6.6 ( 2.87x)
hevc_h_loop_filter_luma12_skip_ssse3:                    6.1 ( 3.08x)
hevc_h_loop_filter_luma12_skip_avx:                      6.2 ( 3.06x)
hevc_h_loop_filter_luma12_strong_c:                    159.0 ( 1.00x)
hevc_h_loop_filter_luma12_strong_sse2:                  36.3 ( 4.38x)
hevc_h_loop_filter_luma12_strong_ssse3:                 36.1 ( 4.40x)
hevc_h_loop_filter_luma12_strong_avx:                   33.5 ( 4.75x)
hevc_h_loop_filter_luma12_weak_c:                       40.1 ( 1.00x)
hevc_h_loop_filter_luma12_weak_sse2:                    35.5 ( 1.13x)
hevc_h_loop_filter_luma12_weak_ssse3:                   36.1 ( 1.11x)
hevc_h_loop_filter_luma12_weak_avx:                      6.2 ( 6.52x)
hevc_v_loop_filter_luma8_skip_c:                        25.5 ( 1.00x)
hevc_v_loop_filter_luma8_skip_sse2:                     10.6 ( 2.40x)
hevc_v_loop_filter_luma8_skip_ssse3:                    11.4 ( 2.24x)
hevc_v_loop_filter_luma8_skip_avx:                       8.3 ( 3.07x)
hevc_v_loop_filter_luma8_strong_c:                     146.8 ( 1.00x)
hevc_v_loop_filter_luma8_strong_sse2:                   43.9 ( 3.35x)
hevc_v_loop_filter_luma8_strong_ssse3:                  43.7 ( 3.36x)
hevc_v_loop_filter_luma8_strong_avx:                    42.3 ( 3.47x)
hevc_v_loop_filter_luma8_weak_c:                        25.5 ( 1.00x)
hevc_v_loop_filter_luma8_weak_sse2:                     10.6 ( 2.40x)
hevc_v_loop_filter_luma8_weak_ssse3:                    44.0 ( 0.58x)
hevc_v_loop_filter_luma8_weak_avx:                       8.3 ( 3.09x)
hevc_v_loop_filter_luma10_skip_c:                       20.0 ( 1.00x)
hevc_v_loop_filter_luma10_skip_sse2:                    11.3 ( 1.77x)
hevc_v_loop_filter_luma10_skip_ssse3:                   11.0 ( 1.82x)
hevc_v_loop_filter_luma10_skip_avx:                      9.3 ( 2.15x)
hevc_v_loop_filter_luma10_strong_c:                    193.5 ( 1.00x)
hevc_v_loop_filter_luma10_strong_sse2:                  46.1 ( 4.19x)
hevc_v_loop_filter_luma10_strong_ssse3:                 44.2 ( 4.38x)
hevc_v_loop_filter_luma10_strong_avx:                   44.4 ( 4.35x)
hevc_v_loop_filter_luma10_weak_c:                       90.3 ( 1.00x)
hevc_v_loop_filter_luma10_weak_sse2:                    46.3 ( 1.95x)
hevc_v_loop_filter_luma10_weak_ssse3:                   10.8 ( 8.37x)
hevc_v_loop_filter_luma10_weak_avx:                     44.4 ( 2.03x)
hevc_v_loop_filter_luma12_skip_c:                       16.8 ( 1.00x)
hevc_v_loop_filter_luma12_skip_sse2:                    11.8 ( 1.42x)
hevc_v_loop_filter_luma12_skip_ssse3:                   11.7 ( 1.43x)
hevc_v_loop_filter_luma12_skip_avx:                      8.7 ( 1.93x)
hevc_v_loop_filter_luma12_strong_c:                    159.3 ( 1.00x)
hevc_v_loop_filter_luma12_strong_sse2:                  45.3 ( 3.52x)
hevc_v_loop_filter_luma12_strong_ssse3:                 60.3 ( 2.64x)
hevc_v_loop_filter_luma12_strong_avx:                   44.1 ( 3.61x)
hevc_v_loop_filter_luma12_weak_c:                       63.6 ( 1.00x)
hevc_v_loop_filter_luma12_weak_sse2:                    45.3 ( 1.40x)
hevc_v_loop_filter_luma12_weak_ssse3:                   11.7 ( 5.41x)
hevc_v_loop_filter_luma12_weak_avx:                     43.9 ( 1.45x)

New benchmarks:
hevc_h_loop_filter_luma8_skip_c:                        24.2 ( 1.00x)
hevc_h_loop_filter_luma8_skip_sse2:                      8.6 ( 2.82x)
hevc_h_loop_filter_luma8_skip_ssse3:                     7.0 ( 3.46x)
hevc_h_loop_filter_luma8_skip_avx:                       6.8 ( 3.54x)
hevc_h_loop_filter_luma8_strong_c:                     150.4 ( 1.00x)
hevc_h_loop_filter_luma8_strong_sse2:                   33.3 ( 4.52x)
hevc_h_loop_filter_luma8_strong_ssse3:                  32.7 ( 4.61x)
hevc_h_loop_filter_luma8_strong_avx:                    32.7 ( 4.60x)
hevc_h_loop_filter_luma8_weak_c:                       104.0 ( 1.00x)
hevc_h_loop_filter_luma8_weak_sse2:                     33.2 ( 3.13x)
hevc_h_loop_filter_luma8_weak_ssse3:                     7.0 (14.91x)
hevc_h_loop_filter_luma8_weak_avx:                      31.3 ( 3.32x)
hevc_h_loop_filter_luma10_skip_c:                       19.2 ( 1.00x)
hevc_h_loop_filter_luma10_skip_sse2:                     6.2 ( 3.08x)
hevc_h_loop_filter_luma10_skip_ssse3:                    6.2 ( 3.08x)
hevc_h_loop_filter_luma10_skip_avx:                      5.0 ( 3.85x)
hevc_h_loop_filter_luma10_strong_c:                    159.8 ( 1.00x)
hevc_h_loop_filter_luma10_strong_sse2:                  30.0 ( 5.32x)
hevc_h_loop_filter_luma10_strong_ssse3:                 29.2 ( 5.48x)
hevc_h_loop_filter_luma10_strong_avx:                   28.6 ( 5.58x)
hevc_h_loop_filter_luma10_weak_c:                       19.2 ( 1.00x)
hevc_h_loop_filter_luma10_weak_sse2:                     6.2 ( 3.09x)
hevc_h_loop_filter_luma10_weak_ssse3:                    6.2 ( 3.09x)
hevc_h_loop_filter_luma10_weak_avx:                      5.0 ( 3.88x)
hevc_h_loop_filter_luma12_skip_c:                       18.7 ( 1.00x)
hevc_h_loop_filter_luma12_skip_sse2:                     6.2 ( 3.00x)
hevc_h_loop_filter_luma12_skip_ssse3:                    5.7 ( 3.27x)
hevc_h_loop_filter_luma12_skip_avx:                      5.2 ( 3.61x)
hevc_h_loop_filter_luma12_strong_c:                    160.2 ( 1.00x)
hevc_h_loop_filter_luma12_strong_sse2:                  34.2 ( 4.68x)
hevc_h_loop_filter_luma12_strong_ssse3:                 29.3 ( 5.48x)
hevc_h_loop_filter_luma12_strong_avx:                   31.4 ( 5.10x)
hevc_h_loop_filter_luma12_weak_c:                       40.2 ( 1.00x)
hevc_h_loop_filter_luma12_weak_sse2:                    35.2 ( 1.14x)
hevc_h_loop_filter_luma12_weak_ssse3:                   29.3 ( 1.37x)
hevc_h_loop_filter_luma12_weak_avx:                      5.0 ( 8.09x)
hevc_v_loop_filter_luma8_skip_c:                        25.6 ( 1.00x)
hevc_v_loop_filter_luma8_skip_sse2:                     10.2 ( 2.52x)
hevc_v_loop_filter_luma8_skip_ssse3:                    10.5 ( 2.45x)
hevc_v_loop_filter_luma8_skip_avx:                       8.2 ( 3.11x)
hevc_v_loop_filter_luma8_strong_c:                     147.1 ( 1.00x)
hevc_v_loop_filter_luma8_strong_sse2:                   42.6 ( 3.45x)
hevc_v_loop_filter_luma8_strong_ssse3:                  42.4 ( 3.47x)
hevc_v_loop_filter_luma8_strong_avx:                    40.1 ( 3.67x)
hevc_v_loop_filter_luma8_weak_c:                        25.6 ( 1.00x)
hevc_v_loop_filter_luma8_weak_sse2:                     10.6 ( 2.42x)
hevc_v_loop_filter_luma8_weak_ssse3:                    42.7 ( 0.60x)
hevc_v_loop_filter_luma8_weak_avx:                       8.2 ( 3.11x)
hevc_v_loop_filter_luma10_skip_c:                       16.7 ( 1.00x)
hevc_v_loop_filter_luma10_skip_sse2:                    11.0 ( 1.52x)
hevc_v_loop_filter_luma10_skip_ssse3:                   10.5 ( 1.59x)
hevc_v_loop_filter_luma10_skip_avx:                      9.6 ( 1.74x)
hevc_v_loop_filter_luma10_strong_c:                    190.0 ( 1.00x)
hevc_v_loop_filter_luma10_strong_sse2:                  44.8 ( 4.24x)
hevc_v_loop_filter_luma10_strong_ssse3:                 42.3 ( 4.49x)
hevc_v_loop_filter_luma10_strong_avx:                   42.5 ( 4.47x)
hevc_v_loop_filter_luma10_weak_c:                       88.3 ( 1.00x)
hevc_v_loop_filter_luma10_weak_sse2:                    45.7 ( 1.93x)
hevc_v_loop_filter_luma10_weak_ssse3:                   10.5 ( 8.40x)
hevc_v_loop_filter_luma10_weak_avx:                     42.4 ( 2.09x)
hevc_v_loop_filter_luma12_skip_c:                       16.7 ( 1.00x)
hevc_v_loop_filter_luma12_skip_sse2:                    11.7 ( 1.42x)
hevc_v_loop_filter_luma12_skip_ssse3:                   10.5 ( 1.59x)
hevc_v_loop_filter_luma12_skip_avx:                      8.8 ( 1.90x)
hevc_v_loop_filter_luma12_strong_c:                    159.4 ( 1.00x)
hevc_v_loop_filter_luma12_strong_sse2:                  45.2 ( 3.53x)
hevc_v_loop_filter_luma12_strong_ssse3:                 59.3 ( 2.69x)
hevc_v_loop_filter_luma12_strong_avx:                   41.7 ( 3.82x)
hevc_v_loop_filter_luma12_weak_c:                       63.3 ( 1.00x)
hevc_v_loop_filter_luma12_weak_sse2:                    44.9 ( 1.41x)
hevc_v_loop_filter_luma12_weak_ssse3:                   10.5 ( 6.02x)
hevc_v_loop_filter_luma12_weak_avx:                     41.7 ( 1.52x)

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2026-01-29 11:54:57 +01:00
..
2026-01-19 16:37:15 +01:00
2025-11-05 16:31:59 +00:00
2025-09-22 23:46:29 +00:00
2025-10-08 20:40:08 +02:00
2025-12-13 18:45:17 -03:00
2025-12-13 18:45:17 -03:00
2025-11-26 15:16:42 +01:00
2025-11-08 18:48:54 +01:00
2026-01-19 20:41:04 +00:00
2025-08-07 19:44:59 +00:00
2025-10-30 03:41:24 +01:00
2025-08-11 20:31:09 +02:00
2026-01-02 18:39:48 +01:00
2026-01-02 18:39:48 +01:00
2026-01-02 18:39:48 +01:00
2026-01-02 18:39:48 +01:00
2025-11-26 15:16:43 +01:00
2025-12-13 18:45:17 -03:00
2025-12-30 17:30:45 +00:00
2025-08-06 21:04:56 +00:00
2026-01-02 18:39:48 +01:00
2025-11-09 02:42:17 +01:00
2025-12-13 18:45:17 -03:00
2026-01-10 22:47:22 +01:00
2025-08-08 18:29:40 +09:00
2025-11-10 01:46:52 +00:00
2025-08-11 11:54:31 +02:00
2025-09-22 23:46:29 +00:00
2025-08-20 11:20:14 +02:00
2025-11-27 11:34:25 +01:00
2025-12-30 14:39:08 -05:00
2025-12-30 14:39:08 -05:00
2026-01-02 18:39:48 +01:00
2026-01-02 18:39:48 +01:00
2026-01-02 18:39:48 +01:00