123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267 |
- #!/bin/sh
- #
- # Copyright (C) 2015 Ernst W. Mayer.
- # Permission is granted to copy, distribute and/or modify
- # this document under the terms of the GNU Free Documentation License,
- # Version 1.3 or any later version published by
- # the Free Software Foundation; with no Invariant Sections,
- # no Front-Cover Texts, and no Back-Cover Texts.
- cat <<'EOF'
- Please check http://hogranch.com/mayer/README.html#news for latest news.
- 11 Dec 2014: v14.1 released. Again thanks to Mike Vang for doing independent-
- build and QA work. This release features significant performance and accuracy
- improvements for all recent x86 platforms, and especial gains for Intel
- Haswell (and soon, Broadwell) users.
- First, a note on version numbering: After many years frozen at 3.0x (since my
- x86 code was until recently wildly uncompetitive with Prime95), now that I'm
- < 2x slower (on Haswell+), resumed version numbering according to the scheme:
- Major index = year - 2000
- Minor index = release # of that year, zero-indexed.
- As before, a patch suffix of x, y, or z following the numeric index indicates
- an [alpha,beta,gamma] (experimental,unstable) code. Since I consider this
- release stable and it's the 2nd release of the year
- , thus 14.1 = [20]14.[--2][no xyz suffix].
- What's new/improved:
- 1. Self-test now has prestored residues for 10000 iterations
- (at least though FFT length 18432K), in addition to the previously-
- supported 100 and 1000. As before, to use a non-default #iters
- (default is 100) for a given self-test range
- , add '-iters [1000 | 10000]' to the command line.
- 2. One no longer needs any special flags like DFT_V* for any FMA-
- using routines - just DUSE_THREADS to enable multithreading
- and DUSE_[SSE2 | AVX | AVX2] to select an x86 SIMD-vector-instruction
- target.
- 3. Propagated Fused multiply-add optimizations to all key discrete
- Fourier transform (DFT) and related arithmetic macros. "FMA everywhere"
- means Haswell users should see at least a 10% speedup for their AVX2
- builds, compared to plain AVX.
- 4. Overall accuracy should be appreciably better, meaning users should
- see very few roundoff warnings, even for 10Kiter self-tests.
- 5. The program now only reports roundoff warnings if it encounters a
- fractional part > 0.40625 (previous was >= 0.4) during the carry step.
- Some self-tests (meaning exponents right at the upper limit for the
- given FFT length, by definition) were emitting slews of 0.40625
- warnings, but as this error level is nearly always benign
- , I've silenced the warnings for it.
- Larger errors will still emit warnings as before.
- 6. The accuracy-problematic radix-11 DFTs (used to build composite
- leading radices such as 44 and 176) have improved accuracy in SSE2
- and AVX modes, but will still emit a few roundoff warnings in longer
- self-tests for certain radix combos in those build types. In AVX2 mode
- , however, the fact that "multiplies are free" (assuming we can fuse
- them with adds, which we can to a very large extent in this case)
- allowed me implement an entirely different radix-11 algorithm which is
- much more multiply-heavy, but has significantly better roundoff
- properties. Thus AVX2 builders will see dramatically lower roundoff
- errors for FFT lengths using the aforementioned leading radices 44
- and 176.
- 7. I added large-stride prefetching to all the carry routines, since
- the 2 DFTs (specifically the final-radix-pass of the inverse FFT
- , followed by the normalize/carry step, followed by the initial-
- radix-pass of the subsequent iteration's forward FFT) bookending the
- carry step in those access data in large strides and are thus
- problematic for the kinds of default data prefetching done by most x86
- hardware. That "manual assist" prefetch should provide a nice boost (5-
- 10% for me at FFT lengths in the Mdoubles range) for all build modes.
- 18 Sep 2014: Special thanks to Mike Vang for doing significant amounts of QA
- work and making numerous feature-related suggestions for this version.
- This release features mostly modest changes:
- Restoration of 32-bit USE_SSE2 build mode (GCC/clang only - no Visual
- Studio). But see the comments in the build section regarding the need
- to build some files using GCC
- (i.e. no pure-Clang builds in 32-bit mode).
- A new initial-FFT-pass/final-iFFT-pass radices 288, which should
- provide a decent (~5%%) speedup for folks doing 100Mdigit-range
- assignments (FFT length 18Mdoubles = 18432K).
- To allow for incremental rerun of testcases (e.g. ones which fail to
- match an independent test done by another user/machine, which is the
- standard matching-double-check requirement for "exponent retirement"
- by GIMPS, the program now saves a unique-named bytewise restart file
- every millionth iteration, i.e. if you are testing the Mersenne
- number M(XXX), in addition to the status (log) file pXXX.stat and the
- pair of redundant checkpoint files pXXX and qXXX, you will also see
- files pXXX.1M, pXXX.2M, etc, get deposited as those iteration
- milestones are passed. Note that in order to avoid an unneeded file-
- copy and to minimize the chances of a bad disk sector from corrupting
- a run, the way this works is that when it comes time to write the
- checkpoint files for (say) iteration 1010000 (1.01M), the code simply
- renames the current pXXX savefile (containing data for iteration 1M)
- to pXXX.1M, then creates a new pXXX file to write the new-checkpoint
- data to. (The redundant q-savefile is unaffected by this).
- Note also that as these files do pile up quickly on a fast machine
- , especially if disk space is constrained (for instance if you are
- using a smallish SSD rather than a big old-style moving-parts HD)
- , you will want to "offload" these Miteration files periodically to
- either a larger drive or backup media, and/or delete them if the
- result double-checks OK.)
- 23 Jun 2014: Special thanks to Stephen Searle for doing significant amounts of
- analysis and debug of the code in this version.
- This release features the following major enhancements and changes:
- Continuing the multithread optimizations described in the previous
- release below, new initial-FFT-pass/final-iFFT-pass radices
- 128,144,160,176,192,208,224,240,256, as well as some larger
- experimental radices 768,960,1008 and 4032. The latter are not
- currently useful for LL testing (as the obligatory self-tests which
- create the mlucas.cfg file optimized for the user's machine will
- reveal, by way of absence of said radices in the best-radix-set data
- captured in the .cfg file), but the radices in the 128-256 range
- should provide a benefit for most users, especially for FFT lengths
- of roughly 2048 Kdoubles and larger.
- Fused-multiply-add (FMA) support for Intel Haswell (and beyond).
- Since Intel released their FMA support in the same chip generation they
- used to deploy the AVX2 instructions, use of FMA is triggered via
- -DUSE_AVX2 at compile time. Currently only a limited fraction of the
- key code macros use FMA, but this will continue to expand as I get a
- better sense of where use of FMA is most likely to yield a benefit.
- (This depends sensitively on the details of the particular FFT
- implementation, for example whether a pre-twiddles or post-twiddles
- complex-multiply scheme is used for the various passes of the inverse
- FFT; Mlucas uses the latter, which is nice from a dataflow-symmetry
- and auxiliary-data-indexing perspective, but is not favorable for an
- FMA-based code speedup.)
- A compact-object code scheme for all the carry-step-wrapping DFT
- radices >= 32. This yields a significant throughout boost for older
- and more bandwidth-limited processors such as Core2 and Sandy/Ivy
- Bridge. The speedups are more modest on Haswell, but even there the
- user will at least enjoy the slashed compile times for the larger-radix
- radix**_ditN_cy_dif1.c sourcefiles in question. Compile (and likely
- run-times) for non-SIMD (i.e. scalar-double C code) builds on non-x86
- hardware will benefit similarly.
- Multiple bugfixes, most related to self-testing and thread-safeness.
- The format for the per-iteration timing data written to mlucas.cfg file
- created by the running the automated self-tests is changed from seconds
- to milliseconds in this version, to provide finer-grained numbers.
- 02 Oct 2013 (Patched rev1 posted 09 May 2014): This features the following
- major enhancements and changes:
- AVX-instructions-set inline assembly support for 64-bit Linux/GCC MacOS
- (both GCC and LLVM/clang). This yields nice speedups over the SSE2-
- based SIMD code on Intel chips supporting AVX (Sandy/Ivy Bridge and
- Haswell/Broadwell). Owners of AMD CPUs featuring AVX are welcome to try
- the code out, but should not get their hopes up too much, as AMD's
- implementation of AVX appears to be disappointing int terms of
- performance.
- Although the 32-bit Windows/MSVC and Linux/GCC inline assembler of the
- previous release is still all there, as of this version 32-bit support
- for x86 SIMD builds is officially discontinued. Builders using 64-bit
- Windows should use a *nix virtualization package such as mingw64.
- The previously-available-by-request-only threadpool code is now
- included in the release. See build instructions below for details.
- Several new carry-step-wrapping "initial FFT pass" DFT radices:
- 48,56,64, all fully SIMD-capable. These are added to the existing
- SIMD-capable radix-16,20,24,28,32,36,40,44,52,60 carry-step-wrapping
- DFTs. The reason I added the new radices is related to ongoing
- experience with multithreaded performance: In particular
- , leading radices greater than 32 or so tend to perform quite poorly
- in unthreaded-build mode and for FFT lengths < 2048 Kdoubles
- (which guided most of the codebase evolution until quite recently)
- , but are standouts in multithreaded mode and for large FFT lengths.
- Since the parallelization strategy I use for my FFT means that
- "maximum number of independent thread-based work chunks" is directly
- related to the above leading-radix, the emerging manycore
- (GPU and similar) paradigm will be driving adoption of even larger DFT
- radices in future releases.
- Multithreaded (pthread/threadpool) support extended to the non-SIMD
- (i.e. scalar-double) code. This replaces the previous and only-
- partially-working threading model based on the OpenMP API, with its
- weird (and virtually-impossible-to-debug) performance issues and
- opaque interface. For code such as mine, opacity of the threading-
- interface is not advantageous, especially in terms of basic-
- development-and-debug work.
- 04 Feb 2013:
- Lots of SSE2-related enhancements, including inline assembler optimized
- for 64-bit OSes via use of the full 16-XMM-register set. New SSE2-
- supported carry-step-wrapping DFT radices, yielding SSE2-able radix-16
- ,20,24,28,32,36,40,44,52,60 carry steps.
- Multithreaded (pthread/threadpool) SSE2 support! This code was used for
- the new-Mersenne-prime verification run described below. The threadpool
- code is not included in the default release; please contact the author
- if you wish to play with multithreaded builds of the code.
- Mlucas SSE2 used to verify the 48th known Mersenne prime. Note that the
- the author could have done the verification himself in around 11 days
- on his humble quad-core Sandy Bridge box, but since for new-prime
- verifies such as this wall-clock time is the overriding factor
- , it makes sense to run on the fastest hardware available, even if
- this is relatively less efficient than running on a fewer-core
- workstation. In the present case, Serge Batalov ran the verify in 6
- days on a 32-core Xeon cluster kindly made available by Novartis Inc.
- Due to poor scaling of the parallel code beyond 4 cores, this
- represents significantly more total cycles (and watt-hours) than a 4-
- core run would need, but we find new Mersenne primes rarely enough
- that such cycle-wastage is justified. (And Why hog all the fun, I say -
- Serge said he hadn't had this much computational fun in years.)
- 06 Nov 2009: Well, it took a full year longer than I had hoped, but a tarball
- of the Mlucas v3.0 beta code described in the entry below is finally available.
- This has SSE2 inline assembly support for 32-bit Windows and 32/64-bit Linux
- , but no PrimeNet support (yet) ... the latter will come later this year, if
- things go reasonably according to plan. A GUI will have to wait for at least
- another year. But the code is sufficiently ready for early adopters to run on
- their x86 machines (Win32, 32 and 64-bit Linux and MacOS ... code is most-
- optimized for the latter) and for builders, profilers and assembler experts to
- have a look and send me feedback and suggestions for improvement.
- 15 Sep 2008: Mlucas 3.0 used to verify 45th and 46th known Mersenne primes.
- Note that the verify runs by Tom Duell and Rob Giltrap of Sun Microsystems used
- a pre-beta version of Mlucas 3.0, scheduled for official release later this
- Fall. Key new features of the upcoming release [besides a radically overhauled
- header-file structure and many other code cleanups and bugfixes] include:
- SSE2 inline assembly support [at least for FFT lengths which are powers
- of 2 or divisible by the small odd primes 3, 5 and 7] - this will
- provide a roughly 2x speedup over the previous generic-C build on the
- newer x86 platforms [AMD64, Core2 and beyond]. Initial targets will be
- 32-bit Windows and 32/64-bit Linux, as well as MacOS..
- Platform-independent compact bytewise savefile format - you can now
- transfer savefiles between any systems having a working 3.0 build
- , independently of the Endian-ness and 32-vs-64-bit-ness of the
- platform.
- Copyright:
- Copyright (C) 2015 Ernst W. Mayer.
- Permission is granted to copy, distribute and/or modify
- this document under the terms of the GNU Free Documentation License,
- Version 1.3 or any later version published by
- the Free Software Foundation; with no Invariant Sections,
- no Front-Cover Texts, and no Back-Cover Texts.
- EOF
|