The idea is to split the 16 bit coefficients into lower and upper half,
invoke udot for the lower half, shift by 8, and follow by udot for the
upper half.
Benchmark on A78:
bgra_to_y_128_c: 682.0 ( 1.00x)
bgra_to_y_128_neon: 181.2 ( 3.76x)
bgra_to_y_128_dotprod: 117.8 ( 5.79x)
bgra_to_y_1080_c: 5742.5 ( 1.00x)
bgra_to_y_1080_neon: 1472.5 ( 3.90x)
bgra_to_y_1080_dotprod: 906.5 ( 6.33x)
bgra_to_y_1920_c: 10194.0 ( 1.00x)
bgra_to_y_1920_neon: 2589.8 ( 3.94x)
bgra_to_y_1920_dotprod: 1573.8 ( 6.48x)
Signed-off-by: Martin Storsjö <martin@martin.st>