gen-ChangeLog-NEWS 14 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267
  1. #!/bin/sh
  2. #
  3. # Copyright (C) 2015 Ernst W. Mayer.
  4. # Permission is granted to copy, distribute and/or modify
  5. # this document under the terms of the GNU Free Documentation License,
  6. # Version 1.3 or any later version published by
  7. # the Free Software Foundation; with no Invariant Sections,
  8. # no Front-Cover Texts, and no Back-Cover Texts.
  9. cat <<'EOF'
  10. Please check http://hogranch.com/mayer/README.html#news for latest news.
  11. 11 Dec 2014: v14.1 released. Again thanks to Mike Vang for doing independent-
  12. build and QA work. This release features significant performance and accuracy
  13. improvements for all recent x86 platforms, and especial gains for Intel
  14. Haswell (and soon, Broadwell) users.
  15. First, a note on version numbering: After many years frozen at 3.0x (since my
  16. x86 code was until recently wildly uncompetitive with Prime95), now that I'm
  17. < 2x slower (on Haswell+), resumed version numbering according to the scheme:
  18. Major index = year - 2000
  19. Minor index = release # of that year, zero-indexed.
  20. As before, a patch suffix of x, y, or z following the numeric index indicates
  21. an [alpha,beta,gamma] (experimental,unstable) code. Since I consider this
  22. release stable and it's the 2nd release of the year
  23. , thus 14.1 = [20]14.[--2][no xyz suffix].
  24. What's new/improved:
  25. 1. Self-test now has prestored residues for 10000 iterations
  26. (at least though FFT length 18432K), in addition to the previously-
  27. supported 100 and 1000. As before, to use a non-default #iters
  28. (default is 100) for a given self-test range
  29. , add '-iters [1000 | 10000]' to the command line.
  30. 2. One no longer needs any special flags like DFT_V* for any FMA-
  31. using routines - just DUSE_THREADS to enable multithreading
  32. and DUSE_[SSE2 | AVX | AVX2] to select an x86 SIMD-vector-instruction
  33. target.
  34. 3. Propagated Fused multiply-add optimizations to all key discrete
  35. Fourier transform (DFT) and related arithmetic macros. "FMA everywhere"
  36. means Haswell users should see at least a 10% speedup for their AVX2
  37. builds, compared to plain AVX.
  38. 4. Overall accuracy should be appreciably better, meaning users should
  39. see very few roundoff warnings, even for 10Kiter self-tests.
  40. 5. The program now only reports roundoff warnings if it encounters a
  41. fractional part > 0.40625 (previous was >= 0.4) during the carry step.
  42. Some self-tests (meaning exponents right at the upper limit for the
  43. given FFT length, by definition) were emitting slews of 0.40625
  44. warnings, but as this error level is nearly always benign
  45. , I've silenced the warnings for it.
  46. Larger errors will still emit warnings as before.
  47. 6. The accuracy-problematic radix-11 DFTs (used to build composite
  48. leading radices such as 44 and 176) have improved accuracy in SSE2
  49. and AVX modes, but will still emit a few roundoff warnings in longer
  50. self-tests for certain radix combos in those build types. In AVX2 mode
  51. , however, the fact that "multiplies are free" (assuming we can fuse
  52. them with adds, which we can to a very large extent in this case)
  53. allowed me implement an entirely different radix-11 algorithm which is
  54. much more multiply-heavy, but has significantly better roundoff
  55. properties. Thus AVX2 builders will see dramatically lower roundoff
  56. errors for FFT lengths using the aforementioned leading radices 44
  57. and 176.
  58. 7. I added large-stride prefetching to all the carry routines, since
  59. the 2 DFTs (specifically the final-radix-pass of the inverse FFT
  60. , followed by the normalize/carry step, followed by the initial-
  61. radix-pass of the subsequent iteration's forward FFT) bookending the
  62. carry step in those access data in large strides and are thus
  63. problematic for the kinds of default data prefetching done by most x86
  64. hardware. That "manual assist" prefetch should provide a nice boost (5-
  65. 10% for me at FFT lengths in the Mdoubles range) for all build modes.
  66. 18 Sep 2014: Special thanks to Mike Vang for doing significant amounts of QA
  67. work and making numerous feature-related suggestions for this version.
  68. This release features mostly modest changes:
  69. Restoration of 32-bit USE_SSE2 build mode (GCC/clang only - no Visual
  70. Studio). But see the comments in the build section regarding the need
  71. to build some files using GCC
  72. (i.e. no pure-Clang builds in 32-bit mode).
  73. A new initial-FFT-pass/final-iFFT-pass radices 288, which should
  74. provide a decent (~5%%) speedup for folks doing 100Mdigit-range
  75. assignments (FFT length 18Mdoubles = 18432K).
  76. To allow for incremental rerun of testcases (e.g. ones which fail to
  77. match an independent test done by another user/machine, which is the
  78. standard matching-double-check requirement for "exponent retirement"
  79. by GIMPS, the program now saves a unique-named bytewise restart file
  80. every millionth iteration, i.e. if you are testing the Mersenne
  81. number M(XXX), in addition to the status (log) file pXXX.stat and the
  82. pair of redundant checkpoint files pXXX and qXXX, you will also see
  83. files pXXX.1M, pXXX.2M, etc, get deposited as those iteration
  84. milestones are passed. Note that in order to avoid an unneeded file-
  85. copy and to minimize the chances of a bad disk sector from corrupting
  86. a run, the way this works is that when it comes time to write the
  87. checkpoint files for (say) iteration 1010000 (1.01M), the code simply
  88. renames the current pXXX savefile (containing data for iteration 1M)
  89. to pXXX.1M, then creates a new pXXX file to write the new-checkpoint
  90. data to. (The redundant q-savefile is unaffected by this).
  91. Note also that as these files do pile up quickly on a fast machine
  92. , especially if disk space is constrained (for instance if you are
  93. using a smallish SSD rather than a big old-style moving-parts HD)
  94. , you will want to "offload" these Miteration files periodically to
  95. either a larger drive or backup media, and/or delete them if the
  96. result double-checks OK.)
  97. 23 Jun 2014: Special thanks to Stephen Searle for doing significant amounts of
  98. analysis and debug of the code in this version.
  99. This release features the following major enhancements and changes:
  100. Continuing the multithread optimizations described in the previous
  101. release below, new initial-FFT-pass/final-iFFT-pass radices
  102. 128,144,160,176,192,208,224,240,256, as well as some larger
  103. experimental radices 768,960,1008 and 4032. The latter are not
  104. currently useful for LL testing (as the obligatory self-tests which
  105. create the mlucas.cfg file optimized for the user's machine will
  106. reveal, by way of absence of said radices in the best-radix-set data
  107. captured in the .cfg file), but the radices in the 128-256 range
  108. should provide a benefit for most users, especially for FFT lengths
  109. of roughly 2048 Kdoubles and larger.
  110. Fused-multiply-add (FMA) support for Intel Haswell (and beyond).
  111. Since Intel released their FMA support in the same chip generation they
  112. used to deploy the AVX2 instructions, use of FMA is triggered via
  113. -DUSE_AVX2 at compile time. Currently only a limited fraction of the
  114. key code macros use FMA, but this will continue to expand as I get a
  115. better sense of where use of FMA is most likely to yield a benefit.
  116. (This depends sensitively on the details of the particular FFT
  117. implementation, for example whether a pre-twiddles or post-twiddles
  118. complex-multiply scheme is used for the various passes of the inverse
  119. FFT; Mlucas uses the latter, which is nice from a dataflow-symmetry
  120. and auxiliary-data-indexing perspective, but is not favorable for an
  121. FMA-based code speedup.)
  122. A compact-object code scheme for all the carry-step-wrapping DFT
  123. radices >= 32. This yields a significant throughout boost for older
  124. and more bandwidth-limited processors such as Core2 and Sandy/Ivy
  125. Bridge. The speedups are more modest on Haswell, but even there the
  126. user will at least enjoy the slashed compile times for the larger-radix
  127. radix**_ditN_cy_dif1.c sourcefiles in question. Compile (and likely
  128. run-times) for non-SIMD (i.e. scalar-double C code) builds on non-x86
  129. hardware will benefit similarly.
  130. Multiple bugfixes, most related to self-testing and thread-safeness.
  131. The format for the per-iteration timing data written to mlucas.cfg file
  132. created by the running the automated self-tests is changed from seconds
  133. to milliseconds in this version, to provide finer-grained numbers.
  134. 02 Oct 2013 (Patched rev1 posted 09 May 2014): This features the following
  135. major enhancements and changes:
  136. AVX-instructions-set inline assembly support for 64-bit Linux/GCC MacOS
  137. (both GCC and LLVM/clang). This yields nice speedups over the SSE2-
  138. based SIMD code on Intel chips supporting AVX (Sandy/Ivy Bridge and
  139. Haswell/Broadwell). Owners of AMD CPUs featuring AVX are welcome to try
  140. the code out, but should not get their hopes up too much, as AMD's
  141. implementation of AVX appears to be disappointing int terms of
  142. performance.
  143. Although the 32-bit Windows/MSVC and Linux/GCC inline assembler of the
  144. previous release is still all there, as of this version 32-bit support
  145. for x86 SIMD builds is officially discontinued. Builders using 64-bit
  146. Windows should use a *nix virtualization package such as mingw64.
  147. The previously-available-by-request-only threadpool code is now
  148. included in the release. See build instructions below for details.
  149. Several new carry-step-wrapping "initial FFT pass" DFT radices:
  150. 48,56,64, all fully SIMD-capable. These are added to the existing
  151. SIMD-capable radix-16,20,24,28,32,36,40,44,52,60 carry-step-wrapping
  152. DFTs. The reason I added the new radices is related to ongoing
  153. experience with multithreaded performance: In particular
  154. , leading radices greater than 32 or so tend to perform quite poorly
  155. in unthreaded-build mode and for FFT lengths < 2048 Kdoubles
  156. (which guided most of the codebase evolution until quite recently)
  157. , but are standouts in multithreaded mode and for large FFT lengths.
  158. Since the parallelization strategy I use for my FFT means that
  159. "maximum number of independent thread-based work chunks" is directly
  160. related to the above leading-radix, the emerging manycore
  161. (GPU and similar) paradigm will be driving adoption of even larger DFT
  162. radices in future releases.
  163. Multithreaded (pthread/threadpool) support extended to the non-SIMD
  164. (i.e. scalar-double) code. This replaces the previous and only-
  165. partially-working threading model based on the OpenMP API, with its
  166. weird (and virtually-impossible-to-debug) performance issues and
  167. opaque interface. For code such as mine, opacity of the threading-
  168. interface is not advantageous, especially in terms of basic-
  169. development-and-debug work.
  170. 04 Feb 2013:
  171. Lots of SSE2-related enhancements, including inline assembler optimized
  172. for 64-bit OSes via use of the full 16-XMM-register set. New SSE2-
  173. supported carry-step-wrapping DFT radices, yielding SSE2-able radix-16
  174. ,20,24,28,32,36,40,44,52,60 carry steps.
  175. Multithreaded (pthread/threadpool) SSE2 support! This code was used for
  176. the new-Mersenne-prime verification run described below. The threadpool
  177. code is not included in the default release; please contact the author
  178. if you wish to play with multithreaded builds of the code.
  179. Mlucas SSE2 used to verify the 48th known Mersenne prime. Note that the
  180. the author could have done the verification himself in around 11 days
  181. on his humble quad-core Sandy Bridge box, but since for new-prime
  182. verifies such as this wall-clock time is the overriding factor
  183. , it makes sense to run on the fastest hardware available, even if
  184. this is relatively less efficient than running on a fewer-core
  185. workstation. In the present case, Serge Batalov ran the verify in 6
  186. days on a 32-core Xeon cluster kindly made available by Novartis Inc.
  187. Due to poor scaling of the parallel code beyond 4 cores, this
  188. represents significantly more total cycles (and watt-hours) than a 4-
  189. core run would need, but we find new Mersenne primes rarely enough
  190. that such cycle-wastage is justified. (And Why hog all the fun, I say -
  191. Serge said he hadn't had this much computational fun in years.)
  192. 06 Nov 2009: Well, it took a full year longer than I had hoped, but a tarball
  193. of the Mlucas v3.0 beta code described in the entry below is finally available.
  194. This has SSE2 inline assembly support for 32-bit Windows and 32/64-bit Linux
  195. , but no PrimeNet support (yet) ... the latter will come later this year, if
  196. things go reasonably according to plan. A GUI will have to wait for at least
  197. another year. But the code is sufficiently ready for early adopters to run on
  198. their x86 machines (Win32, 32 and 64-bit Linux and MacOS ... code is most-
  199. optimized for the latter) and for builders, profilers and assembler experts to
  200. have a look and send me feedback and suggestions for improvement.
  201. 15 Sep 2008: Mlucas 3.0 used to verify 45th and 46th known Mersenne primes.
  202. Note that the verify runs by Tom Duell and Rob Giltrap of Sun Microsystems used
  203. a pre-beta version of Mlucas 3.0, scheduled for official release later this
  204. Fall. Key new features of the upcoming release [besides a radically overhauled
  205. header-file structure and many other code cleanups and bugfixes] include:
  206. SSE2 inline assembly support [at least for FFT lengths which are powers
  207. of 2 or divisible by the small odd primes 3, 5 and 7] - this will
  208. provide a roughly 2x speedup over the previous generic-C build on the
  209. newer x86 platforms [AMD64, Core2 and beyond]. Initial targets will be
  210. 32-bit Windows and 32/64-bit Linux, as well as MacOS..
  211. Platform-independent compact bytewise savefile format - you can now
  212. transfer savefiles between any systems having a working 3.0 build
  213. , independently of the Endian-ness and 32-vs-64-bit-ness of the
  214. platform.
  215. Copyright:
  216. Copyright (C) 2015 Ernst W. Mayer.
  217. Permission is granted to copy, distribute and/or modify
  218. this document under the terms of the GNU Free Documentation License,
  219. Version 1.3 or any later version published by
  220. the Free Software Foundation; with no Invariant Sections,
  221. no Front-Cover Texts, and no Back-Cover Texts.
  222. EOF