NEWS 7.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
  1. Please check http://hogranch.com/mayer/README.html#news for latest news.
  2. 07 Feb 2016: Patched source file:
  3. Several users have reported a bug in the get_fft_radices.c file,
  4. which affects non-SIMD unthreaded builds (i.e. ones for which the
  5. prepocessor flag USE_ONLY_LARGE_LEAD_RADICES = False based on the logic
  6. at top of the file); my pre-release testing apparently omitted this
  7. build mode. The error message looks like this:
  8. ../src/get_fft_radices.c: In function 'get_fft_radices':
  9. ../src/get_fft_radices.c:1446:3: error: duplicate case value
  10. ../src/get_fft_radices.c:1443:3: error: previously used here
  11. The fix is to either +1 increment both of the two case-values in the
  12. #ifndef USE_ONLY_LARGE_LEAD_RADICES - wrapped logic following the above
  13. line, or to use this patched version of the file
  14. (15KB, md5 checksum = db5d2504d58897229d0366f4749b4131) from my
  15. dev-branch, which fixes the duplicate case error and further adds some
  16. radix sets which should help with non-SIMD build performance.
  17. 24 Aug 2015: Beta versions of automated build tools:
  18. Thanks to Alex Vong for these, which he created as part of a proposed
  19. Debian packaging of the Mlucas 14.1 release. The resulting tarballs
  20. also contain numerous bugfixes (mostly minor) which will also be in the
  21. upcoming v15 release.
  22. In addition to pristine tarballs with C source files only (and manual
  23. build procedure documented below in
  24. "Download and Build the Source Code"), we provide experimental tarballs
  25. with autotools as well, which for the current release come in 2
  26. "same package, different compression tools" tarballs:
  27. mlucas-14.1.tar.gz (3.4 MB) and mlucas-14.1.tar.xz (1.5 MB). If your
  28. Linux distro (or personal setup) has support for the Xz compression
  29. package, you'll obviously want to get the second,
  30. much-smaller-compressed tarball. Once you have downloaded the desired
  31. one of the tarballs, optionally, to verify the integrity of the
  32. downloaded file, first import the needed public key via
  33. $ gpg --keyserver pgp.mit.edu --recv-keys 93518580
  34. Then, depending on whether you downloaded the .gz or .xz-compressed
  35. tarball, download mlucas-14.1.tar.gz.sig or mlucas-14.1.tar.xz.sig,
  36. respectively, to obtain the corresponding detached package signature.
  37. (Note that despite the extensions these are both simple ascii text
  38. files.)
  39. Finally, verify the package using
  40. $ gpg --verify mlucas-14.1.tar.gz.sig mlucas-14.1.tar.gz
  41. or
  42. $ gpg --verify mlucas-14.1.tar.xz.sig mlucas-14.1.tar.xz
  43. A successful signature verification returns
  44. "Good signature from Alex Vong <e-mail address>' --
  45. note that the ensuing
  46. WARNING: This key is not certified with a trusted signature!
  47. There is no indication that the signature belongs to the owner.
  48. is expected.
  49. The README file in the untarred package (a text file, not to be
  50. confused with this HTML page which you are reading) contains simple
  51. instructions on how to compile, run tests and install. If you do not
  52. understand the instructions, please consult the INSTALL file which
  53. contains detailed instructions and explanations. You can also search
  54. for the basic Linux make-package command sequence `./configure && make'
  55. on the Internet.
  56. 11 Dec 2014: v14.1 released. Again thanks to Mike Vang for doing
  57. independent-build and QA work. This release features significant performance
  58. and accuracy improvements for all recent x86 platforms, and especial gains for
  59. Intel Haswell (and soon, Broadwell) users.
  60. First, a note on version numbering: After many years frozen at 3.0x
  61. (since my x86 code was until recently wildly uncompetitive with Prime95),
  62. now that I'm < 2x slower (on Haswell+), resumed version numbering according to
  63. the scheme:
  64. Major index = year - 2000
  65. Minor index = release # of that year, zero-indexed.
  66. As before, a patch suffix of x, y, or z following the numeric index indicates
  67. an [alpha,beta,gamma] (experimental,unstable) code. Since I consider this
  68. release stable and it's the 2nd release of the year,
  69. thus 14.1 = [20]14.[--2][no xyz suffix].
  70. What's new/improved:
  71. 1. Self-test now has prestored residues for 10000 iterations (at least
  72. though FFT length 18432K), in addition to the previously-supported 100
  73. and 1000. As before, to use a non-default #iters (default is 100) for a
  74. given self-test range, add '-iters [1000 | 10000]' to the command line.
  75. 2. One no longer needs any special flags like DFT_V* for any FMA-using
  76. routines - just DUSE_THREADS to enable multithreading and
  77. DUSE_[SSE2 | AVX | AVX2] to select an x86 SIMD-vector-instruction
  78. target.
  79. 3. Propagated Fused multiply-add optimizations to all key discrete
  80. Fourier transform (DFT) and related arithmetic macros. "FMA everywhere"
  81. means Haswell users should see at least a 10% speedup for their AVX2
  82. builds, compared to plain AVX.
  83. 4. Overall accuracy should be appreciably better, meaning users should
  84. see very few roundoff warnings, even for 10Kiter self-tests.
  85. 5. The program now only reports roundoff warnings if it encounters a
  86. fractional part > 0.40625 (previous was >= 0.4) during the carry step.
  87. Some self-tests (meaning exponents right at the upper limit for the
  88. given FFT length, by definition) were emitting slews of 0.40625
  89. warnings, but as this error level is nearly always benign, I've
  90. silenced the warnings for it. Larger errors will still emit warnings as
  91. before.
  92. 6. The accuracy-problematic radix-11 DFTs (used to build composite
  93. leading radices such as 44 and 176) have improved accuracy in SSE2 and
  94. AVX modes, but will still emit a few roundoff warnings in longer
  95. self-tests for certain radix combos in those build types. In AVX2 mode,
  96. however, the fact that "multiplies are free" (assuming we can fuse them
  97. with adds, which we can to a very large extent in this case) allowed me
  98. implement an entirely different radix-11 algorithm which is much more
  99. multiply-heavy, but has significantly better roundoff properties. Thus
  100. AVX2 builders will see dramatically lower roundoff errors for FFT
  101. lengths using the aforementioned leading radices 44 and 176.
  102. 7. I added large-stride prefetching to all the carry routines, since
  103. the 2 DFTs (specifically the final-radix-pass of the inverse FFT,
  104. followed by the normalize/carry step, followed by the
  105. initial-radix-pass of the subsequent iteration's forward FFT)
  106. bookending the carry step in those access data in large strides and are
  107. thus problematic for the kinds of default data prefetching done by most
  108. x86 hardware. That "manual assist" prefetch should provide a nice boost
  109. (5-10% for me at FFT lengths in the Mdoubles range) for all build
  110. modes.
  111. Copyright:
  112. Copyright (C) 2015 Ernst W. Mayer.
  113. Permission is granted to copy, distribute and/or modify
  114. this document under the terms of the GNU Free Documentation License,
  115. Version 1.3 or any later version published by
  116. the Free Software Foundation; with no Invariant Sections,
  117. no Front-Cover Texts, and no Back-Cover Texts.