This can be done by reusing the destination array to store
a temporary LUT (with only half the amount of elements, but
double the element size). This relies on certain assumptions
about sizes, but they are always fulfilled for systems supported
by us (sizeof(double) == 8 is needed/guaranteed since commit
3383a53e7d). Furthermore,
sizeof(uint32_t) is always >= four because CHAR_BIT is eight
(in fact, sizeof(uint32_t) is four, because the exact width
types don't have any padding).
Reviewed-by: Zhao Zhili <quinkblack@foxmail.com>
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
ff_cbrt_tableinit{,_fixed}() uses a LUT of doubles of the same size
as the actual LUT to be initialized (8192 elems). Said LUT is only
used to initialize another LUT, but because it is static, the dirty
memory (64KiB) is not released. It is also too large to be put on
the stack.
This commit mitigates this: We only use a LUT for the powers of
odd values, thereby halving its size. The generated LUT stays unchanged.
Reviewed-by: Zhao Zhili <quinkblack@foxmail.com>
Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
This exploits an approach based on the sieve of Eratosthenes, a popular
method for generating prime numbers.
Tables are identical to previous ones.
Tested with FATE with/without --enable-hardcoded-tables.
Sample benchmark (Haswell, GNU/Linux+gcc):
prev:
7860100 decicycles in cbrt_tableinit, 1 runs, 0 skips
7777490 decicycles in cbrt_tableinit, 2 runs, 0 skips
[...]
7582339 decicycles in cbrt_tableinit, 256 runs, 0 skips
7563556 decicycles in cbrt_tableinit, 512 runs, 0 skips
new:
2099480 decicycles in cbrt_tableinit, 1 runs, 0 skips
2044470 decicycles in cbrt_tableinit, 2 runs, 0 skips
[...]
1796544 decicycles in cbrt_tableinit, 256 runs, 0 skips
1791631 decicycles in cbrt_tableinit, 512 runs, 0 skips
Both small and large run count given as this is called once so small run
count may give a better picture, small numbers are fairly consistent,
and there is a consistent downward trend from small to large runs,
at which point it stabilizes to a new value.
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
On systems having cbrt, there is no reason to use the slow pow function.
Sample benchmark (x86-64, Haswell, GNU/Linux):
new:
5124920 decicycles in cbrt_tableinit, 1 runs, 0 skips
old:
12321680 decicycles in cbrt_tableinit, 1 runs, 0 skips
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
Use macros from aac_defines.h for adding suffixes
instead of local macros.
Signed-off-by: Nedeljko Babic <nedeljko.babic@imgtec.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Add fixed point implementation of functions for generating tables
Signed-off-by: Nedeljko Babic <nedeljko.babic@imgtec.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
cbrtf() took floats but it represented 1/3 exactly
and even if not more precission should be better in theory
for the table generation
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
You cannot count on them being present on all systems, and you
cannot include libm.h in a host tool, so just hard code baseline
implementations.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
* qatar/master:
ffmpeg: get rid of the -vglobal option.
dct32: Add AVX implementation of 32-point DCT
dct32: Change pass 6 permutation to allow for AVX implementation
dct32: port SSE 32-point DCT to YASM
multiple inclusion guard cleanup
avio: document buffer must created with av_malloc() and friends
avio: check AVIOContext malloc failure
swscale: point out an alternative to sws_getContext
svq3: Do initialization after parsing the extradata
add changelog entries for 0.7_beta2
mp3lame: add #include required for AV_RB32 macro.
Conflicts:
Changelog
libavcodec/svq3.c
libavcodec/x86/dct32_sse.c
libavfilter/vsrc_buffer.h
Merged-by: Michael Niedermayer <michaelni@gmx.at>