123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148 |
- Please check http://hogranch.com/mayer/README.html#news for latest news.
- 07 Feb 2016: Patched source file:
- Several users have reported a bug in the get_fft_radices.c file,
- which affects non-SIMD unthreaded builds (i.e. ones for which the
- prepocessor flag USE_ONLY_LARGE_LEAD_RADICES = False based on the logic
- at top of the file); my pre-release testing apparently omitted this
- build mode. The error message looks like this:
- ../src/get_fft_radices.c: In function 'get_fft_radices':
- ../src/get_fft_radices.c:1446:3: error: duplicate case value
- ../src/get_fft_radices.c:1443:3: error: previously used here
- The fix is to either +1 increment both of the two case-values in the
- #ifndef USE_ONLY_LARGE_LEAD_RADICES - wrapped logic following the above
- line, or to use this patched version of the file
- (15KB, md5 checksum = db5d2504d58897229d0366f4749b4131) from my
- dev-branch, which fixes the duplicate case error and further adds some
- radix sets which should help with non-SIMD build performance.
- 24 Aug 2015: Beta versions of automated build tools:
- Thanks to Alex Vong for these, which he created as part of a proposed
- Debian packaging of the Mlucas 14.1 release. The resulting tarballs
- also contain numerous bugfixes (mostly minor) which will also be in the
- upcoming v15 release.
- In addition to pristine tarballs with C source files only (and manual
- build procedure documented below in
- "Download and Build the Source Code"), we provide experimental tarballs
- with autotools as well, which for the current release come in 2
- "same package, different compression tools" tarballs:
- mlucas-14.1.tar.gz (3.4 MB) and mlucas-14.1.tar.xz (1.5 MB). If your
- Linux distro (or personal setup) has support for the Xz compression
- package, you'll obviously want to get the second,
- much-smaller-compressed tarball. Once you have downloaded the desired
- one of the tarballs, optionally, to verify the integrity of the
- downloaded file, first import the needed public key via
- $ gpg --keyserver pgp.mit.edu --recv-keys 93518580
- Then, depending on whether you downloaded the .gz or .xz-compressed
- tarball, download mlucas-14.1.tar.gz.sig or mlucas-14.1.tar.xz.sig,
- respectively, to obtain the corresponding detached package signature.
- (Note that despite the extensions these are both simple ascii text
- files.)
- Finally, verify the package using
- $ gpg --verify mlucas-14.1.tar.gz.sig mlucas-14.1.tar.gz
- or
- $ gpg --verify mlucas-14.1.tar.xz.sig mlucas-14.1.tar.xz
- A successful signature verification returns
- "Good signature from Alex Vong <e-mail address>' --
- note that the ensuing
- WARNING: This key is not certified with a trusted signature!
- There is no indication that the signature belongs to the owner.
- is expected.
- The README file in the untarred package (a text file, not to be
- confused with this HTML page which you are reading) contains simple
- instructions on how to compile, run tests and install. If you do not
- understand the instructions, please consult the INSTALL file which
- contains detailed instructions and explanations. You can also search
- for the basic Linux make-package command sequence `./configure && make'
- on the Internet.
- 11 Dec 2014: v14.1 released. Again thanks to Mike Vang for doing
- independent-build and QA work. This release features significant performance
- and accuracy improvements for all recent x86 platforms, and especial gains for
- Intel Haswell (and soon, Broadwell) users.
- First, a note on version numbering: After many years frozen at 3.0x
- (since my x86 code was until recently wildly uncompetitive with Prime95),
- now that I'm < 2x slower (on Haswell+), resumed version numbering according to
- the scheme:
- Major index = year - 2000
- Minor index = release # of that year, zero-indexed.
- As before, a patch suffix of x, y, or z following the numeric index indicates
- an [alpha,beta,gamma] (experimental,unstable) code. Since I consider this
- release stable and it's the 2nd release of the year,
- thus 14.1 = [20]14.[--2][no xyz suffix].
- What's new/improved:
- 1. Self-test now has prestored residues for 10000 iterations (at least
- though FFT length 18432K), in addition to the previously-supported 100
- and 1000. As before, to use a non-default #iters (default is 100) for a
- given self-test range, add '-iters [1000 | 10000]' to the command line.
- 2. One no longer needs any special flags like DFT_V* for any FMA-using
- routines - just DUSE_THREADS to enable multithreading and
- DUSE_[SSE2 | AVX | AVX2] to select an x86 SIMD-vector-instruction
- target.
- 3. Propagated Fused multiply-add optimizations to all key discrete
- Fourier transform (DFT) and related arithmetic macros. "FMA everywhere"
- means Haswell users should see at least a 10% speedup for their AVX2
- builds, compared to plain AVX.
- 4. Overall accuracy should be appreciably better, meaning users should
- see very few roundoff warnings, even for 10Kiter self-tests.
- 5. The program now only reports roundoff warnings if it encounters a
- fractional part > 0.40625 (previous was >= 0.4) during the carry step.
- Some self-tests (meaning exponents right at the upper limit for the
- given FFT length, by definition) were emitting slews of 0.40625
- warnings, but as this error level is nearly always benign, I've
- silenced the warnings for it. Larger errors will still emit warnings as
- before.
- 6. The accuracy-problematic radix-11 DFTs (used to build composite
- leading radices such as 44 and 176) have improved accuracy in SSE2 and
- AVX modes, but will still emit a few roundoff warnings in longer
- self-tests for certain radix combos in those build types. In AVX2 mode,
- however, the fact that "multiplies are free" (assuming we can fuse them
- with adds, which we can to a very large extent in this case) allowed me
- implement an entirely different radix-11 algorithm which is much more
- multiply-heavy, but has significantly better roundoff properties. Thus
- AVX2 builders will see dramatically lower roundoff errors for FFT
- lengths using the aforementioned leading radices 44 and 176.
- 7. I added large-stride prefetching to all the carry routines, since
- the 2 DFTs (specifically the final-radix-pass of the inverse FFT,
- followed by the normalize/carry step, followed by the
- initial-radix-pass of the subsequent iteration's forward FFT)
- bookending the carry step in those access data in large strides and are
- thus problematic for the kinds of default data prefetching done by most
- x86 hardware. That "manual assist" prefetch should provide a nice boost
- (5-10% for me at FFT lengths in the Mdoubles range) for all build
- modes.
- Copyright:
- Copyright (C) 2015 Ernst W. Mayer.
- Permission is granted to copy, distribute and/or modify
- this document under the terms of the GNU Free Documentation License,
- Version 1.3 or any later version published by
- the Free Software Foundation; with no Invariant Sections,
- no Front-Cover Texts, and no Back-Cover Texts.
|