alexvong1995/mlucas: Unofficial mirror of Mlucas using autotools as build system. Please visit the author's homepage http://www.mersenneforum.org/mayer/README.html for more information. @ arm64

Alex Vong d5fed6cd9c Update html/README.html to revision 07 Feb 2016.		7 سال پیش
..
README.html	d5fed6cd9c Update html/README.html to revision 07 Feb 2016.	7 سال پیش

xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUBAMAAAB/pwA+AAAABGdBTUEAALGPC/xhBQAAACBjSFJN
AAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAAElBMVEXAwMAAAAAAAP+AgIAA
z//AwMBJynhnAAAAAXRSTlMAQObYZgAAAAFiS0dEAIgFHUgAAAAHdElNRQffBx4EJRo/wnmnAAAA
T0lEQVQI14XNwQ2AMAxDUVcsEHeDWizADt1/JopkV3Aip6+ntAGAhj29Uk1jo1RBhReaG7sGGX49
+838+2T40vccWbk3Z5lPJ7jKeZDe9dxvUAmU+fCuegAAACV0RVh0ZGF0ZTpjcmVhdGUAMjAxNS0w
Ny0yOVQxNDoxNzo0NSswODowMB+r+7YAAAAldEVYdGRhdGU6bW9kaWZ5ADIwMTUtMDYtMTJUMDQ6
Mjk6MjcrMDg6MDD7gRwBAAAAAElFTkSuQmCC" />
Ernst Home

xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAABGdBTUEAALGPC/xhBQAAACBjSFJN
AAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAABrVBMVEXAwMAA3S8E6Cci+S4l
/SYK/C4D8zcA1S8AzC4p9jvd/92w/6tq/2AX/zcI8DkBzS8BricApCMd9i+u/6nH/8J0/2sW/0UF
3DMBvisBoyQBhBwAiB8A6Cxi/1iI/37L/8b///+Y/5Ma80oCxi0BrCcBlCABeR0BWhUD9y4p/zY2
/z6B/3k//1gIyzIBrygBmSIBghwBaRoBThIAUxIB8jYN/0IR+EQd+UwZ50YIxTABrCYBmCEBbxsB
WBQBQA4AOQsA1TAB1C8BzC4BwigBtCkBpCUBghsBXBYBRxABMQoANQkAuCoBtSkBpSUBiyEBex4B
axoBWhYBSBABNAoBLAkAMQgAoiMBlSIBkSEBiSABfhsBcRoBZBgBVRQBRRABNAsBLQ0BJgwAtigB
cxwBchsBbBsBYxgBWBUBSxEBPg4BMA4BLAgBJwwBIAoAeh0BURIBPg0BLg4BKg0BJQsBIQoAYxYB
Lw0BKgoBIwsBHgnBwcEzMzMATBAALw0ALgxsbGzExMSlpaWIiIhPT0+vr6+qqqqSkpJ6enpiYmJL
S0tISEhcXFxxcXGGhoabm5ux5ja/AAAAAXRSTlMAQObYZgAAAAFiS0dEILNrPYAAAAAHdElNRQff
Bx4EIwJ69EZ3AAAA2ElEQVQY02NgAANGIGBAAEYmZhZWNnYOOJ+Ti5uHl49fQBDKFxIW4RYVE5eQ
lJIGC8jIyskrKCopq6iqqYMVaGhqaSvq6OrpGxgaGYMETEzNzC0sraylbGzt7EECDo5Ozi6uqm42
7h6eXiABbx89X2s//4DAoOCQUJAhYeERkVHRMbFx8QmJYFuSklNS09IzMrOyc3IhDsnLN/IoiC8s
Ki6BObU0qyyhPKeiEsavqqquCa2tq66ubwBzG5vqm6uBoLm+qbGqiqGlpbWtvaOzurqru6e3r6UF
AB06NPiCOXZ9AAAAJXRFWHRkYXRlOmNyZWF0ZQAyMDE1LTA3LTI5VDE0OjE3OjQ1KzA4OjAwH6v7
tgAAACV0RVh0ZGF0ZTptb2RpZnkAMjAxNS0wNi0xMlQwNDoyOToyNyswODowMPuBHAEAAAAASUVO
RK5CYII=" />

Welcome to the the Great Internet Mersenne Prime Search! (The not-PC-only version ;)

This ftp site contains Ernst Mayer's C source code for performing Lucas-Lehmer tests and sieve-based trial-factoring (TF) of prime-exponent Mersenne numbers. (Although see notes below about the GPU clients now being preferable for sieving.) In short, everything you need to search for Mersenne primes on your Intel, AMD or non-x86-CPU-based computer!

Mlucas is an open-source program for primality testing of Mersenne numbers in search of a world-record prime.
You may use it to test any suitable number as you wish, but it is preferable that you do so in a coordinated fashion,
as part of the Great Internet Mersenne Prime Search (GIMPS). Note that Mlucas is not (yet)
as efficient as the main GIMPS client, George Woltman's Prime95 program (a.k.a. mprime for the linux version), but
that program is not truly open-source, and requires the user to abide by the prize-sharing rules set by its author, should a user be lucky enough to find a new prime eligible for one of the monetary prizes offered by the Electronic Freedom Foundation. Prime95 is also only available for platforms based on the x86 processor architecture.

Quick Find Guide:

Recent News: Beta of automated build tools, v14.1 released
Descriptions of previous code releases
General Questions
Download and Build the Source Code
Performance-tune for your machine
Get exponents from PrimeNet
Lucas-Lehmer test
Report results
Tracking your contribution
Just the FAQs, please (Frequently Asked Questions)
GNU Free Documentation License

Recent News:

xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADwAAAAiCAMAAAA0/kqrAAAABGdBTUEAALGPC/xhBQAAACBjSFJN
AAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAC01BMVEXGxsbOzs7Gzs61xsbW
1taMjIxSUlLG1tachITGnJy9vb0hISEIEBCllJTnEBC9Y2O9zs5CQkIAAAAACAjGEBD/AADOOTm1
zs61tbVra2t7GCHnCAitpaW91tZ7e3u1a2uUlJQICAhaCBC9c3MxMTGEjIyle3tjY2M5QkLOGBje
GBjGjIxrc3Occ3MIGBgxCBDvCAjGUlK1ra2ctbWtvb05OTkhGBjWAADnGBilpaXe3t57EBDnAACt
UlIYGBg5GBj3CAhSCAgQEBDOEBCtra0pOTmEhIS1hIQ5SkqthIS1WlpSEBAhCAjGCAgpEBCMCAjO
CAh7Y2MxAABzAADOe3tja2sYKSlSAADWKSm9e3utEBDvAADWQkJjKSnWCAiMAABaY2MhOTkYCAit
lJStEBhCAACEAAD/SgD/nAD3lAj/EAD/GAD/hAD/awD3KQD/jAD/UgDvGADWewD3awD/rQD/IQD/
7wD//wD3lAD3AADGWgDGAAD33gC9UgC9KQDWYwDnvQDepQDOAAD39wC9GADenAC1SkqlxsacnJzn
nADv5wDnWgC9OQD/vQClIQDeAADOMQDexgDOawDeYwDGUgCtMQDOWlrGY2PGSkq9UlKtOTm1c3MQ
ISFaAAi9SgDeOQD/xgD3/wDvKQD3xgDOSgDOKQDv1gCUEADnISG1lJQpAADncwDvcwDWnAD3UgDn
KQDvxgDOQgCtAADOSkopKSlKMTGlISHnpQD37wClMQC1KQDOIQDWcwC9AACttbVrY2MYAACMGBj3
1gClEADvEBCt1tb/MQD3QgC1OQClSgD/tQDGpQCUAABrCAjnUgD3pQDOjADnewDvrQCtEACtSgCU
MQA5CAilAABzhISMc3OEWlqUUlJKSkoQAADeSkq1OTkxCAjWMTHeCAiUra29EBBCGBjGc3OcQkK9
QkKEOTmcra3GWlrGMTGclJStWlqlra3GxsbcuE+wAAAAAXRSTlMAQObYZgAAAAFiS0dEAIgFHUgA
AAAHdElNRQfQAQsFNRJjyGkrAAAEW0lEQVRIx6VVi1cUZRSfb3Zg2A2X5VsuOwrI8lAeIgIu8mix
FTc1RcGQgCbRpARh2fXR7jCuRayKPFuJtCQV0DLIzFwwjITKHvaid0plDzTL7PUvNLNjtYtERPfM
nPOdOd/v3vv73ccQxISG3M/UTACSsiliKR9f2k8+tdBIcYv/NGWA6j/k+VcghAKxOgiCNROHRkiC
IIahFDIkOSHQdIxnAISEkhOCGSHDMHfQwJmKcIYWjyStjcCRIFgU7eFfOCCvAiCf6FmzwxkVSVAx
ODYufg6dEMuEz03E85JEcLKKQBQhc8uOEJkyX8XoPLNOxQvS0gMyVJlajG8FfdZCfJtBvyh7sYiF
eB0yJty+ZGmAgqIUPst871jup/CIjTIi8AoAZbJ8fiLOEe6vXJULYMgDyVYvW3BnfjysCdMW3JVT
WKQs9hZBSFd9t3CNvWdtBC4RDgYD/G3r8kU3i9ffu6EwNwSUpWNqJ1QU57gJQlDkfTCuRd6/UXQb
J6dv7iWRrGhl5ZsqoNJUZbZYTKZyk8VSBpvLLWbYsnXbA1aAYB3tlt5bcV8c4eZotnHVPFRvt+/g
Hnyo5uFabqvDbNu5S1+xaTdXk1S3J0yl0chkGo0Hb5RZj0XNACobGrkmMDXrW7hHnPq9zbZWB5Rz
j7JlbY/Z9pXsf/yJA3V17e1Zy5+cZpSQFDIWHNxwaKUItrYdtnEdVR18Z1dz1RG98yj3FF/OtTpN
T3cce4a9wZ/tTu5hFG6wYkkgfnZjiPTdetz+HHe0uoNt4U4c28XC89zuk00v7Dy1zWVucIkX+PTe
qFCZ0FESa5/AvtMvFnWHpLNChazH0/h+7sxetpN7afsAD5W2M/0vn+V21NihaRAM3b1Lh1Qq0rNJ
CNmQTpfyyqs5r4H1rAO2nOsaYFu69jntR6xg4U64zLbG1/VwUg9vzH6T9h4yUXiKMsaqc3J5eKvB
CXC+8W3+HHe4wtx5irXa3hmEd997H6q480L3BAzdXGdFjBoPi7Sr+jsEXm0DH3xY+9HHn9R+ykKb
oFPZZ3bY3Pp5HstDUc/YefZZiC9Ircy7cgXV+TTWcdE1OOiyA+gdwmeHILR9xsgXX36VfKDes8bU
0CX89TpeHKBvvj303Ty3F56X3j/NII7YKCP0COGxEhETk40LL6eP7i/uUVw5iFc5wfD98AqlhCkR
IuaJXW9YezXfxSZ5TxRiZhasX71mDiMLI2ljX+LpoOAf/PzxBQnryr4IIOUyemXE9+qP1+bqvNWm
CJKWKqfVMnK5LHME42E3tt1f/RN0J6ixMN68PJPWXZeFpvzzJiRVJIku4YggkfbPe6L96w3JdH0f
XnQZrmmIMftrPA+ENNxJUZrM6an0L8U0RcTg1N4Q+cRrVAJr3VnHl4aRBJOBfjUigqS0qdfDf1P9
O1jYSf5xUGQkpTTIG7oa6QzFJCKnZs9SBuvG5Di5/yUyRmeUZikmwW9cMIN+l/2Pv/IkoX8A+kAt
dug2TV8AAAAldEVYdGRhdGU6Y3JlYXRlADIwMTctMDEtMTlUMTk6NTU6MDIrMDg6MDBNUJEUAAAA
JXRFWHRkYXRlOm1vZGlmeQAyMDAwLTAxLTExVDA1OjUzOjE4KzA4OjAw08UVaQAAAABJRU5ErkJg
gg==" />

xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAABGdBTUEAALGPC/xhBQAAACBjSFJN
AAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAABrVBMVEXAwMAA3S8E6Cci+S4l
/SYK/C4D8zcA1S8AzC4p9jvd/92w/6tq/2AX/zcI8DkBzS8BricApCMd9i+u/6nH/8J0/2sW/0UF
3DMBvisBoyQBhBwAiB8A6Cxi/1iI/37L/8b///+Y/5Ma80oCxi0BrCcBlCABeR0BWhUD9y4p/zY2
/z6B/3k//1gIyzIBrygBmSIBghwBaRoBThIAUxIB8jYN/0IR+EQd+UwZ50YIxTABrCYBmCEBbxsB
WBQBQA4AOQsA1TAB1C8BzC4BwigBtCkBpCUBghsBXBYBRxABMQoANQkAuCoBtSkBpSUBiyEBex4B
axoBWhYBSBABNAoBLAkAMQgAoiMBlSIBkSEBiSABfhsBcRoBZBgBVRQBRRABNAsBLQ0BJgwAtigB
cxwBchsBbBsBYxgBWBUBSxEBPg4BMA4BLAgBJwwBIAoAeh0BURIBPg0BLg4BKg0BJQsBIQoAYxYB
Lw0BKgoBIwsBHgnBwcEzMzMATBAALw0ALgxsbGzExMSlpaWIiIhPT0+vr6+qqqqSkpJ6enpiYmJL
S0tISEhcXFxxcXGGhoabm5ux5ja/AAAAAXRSTlMAQObYZgAAAAFiS0dEILNrPYAAAAAHdElNRQff
Bx4EIwJ69EZ3AAAA2ElEQVQY02NgAANGIGBAAEYmZhZWNnYOOJ+Ti5uHl49fQBDKFxIW4RYVE5eQ
lJIGC8jIyskrKCopq6iqqYMVaGhqaSvq6OrpGxgaGYMETEzNzC0sraylbGzt7EECDo5Ozi6uqm42
7h6eXiABbx89X2s//4DAoOCQUJAhYeERkVHRMbFx8QmJYFuSklNS09IzMrOyc3IhDsnLN/IoiC8s
Ki6BObU0qyyhPKeiEsavqqquCa2tq66ubwBzG5vqm6uBoLm+qbGqiqGlpbWtvaOzurqru6e3r6UF
AB06NPiCOXZ9AAAAJXRFWHRkYXRlOmNyZWF0ZQAyMDE1LTA3LTI5VDE0OjE3OjQ1KzA4OjAwH6v7
tgAAACV0RVh0ZGF0ZTptb2RpZnkAMjAxNS0wNi0xMlQwNDoyOToyNyswODowMPuBHAEAAAAASUVO
RK5CYII=" />
07 Feb 2016: Patched source file:

Several users have reported a bug in the get_fft_radices.c file, which affects non-SIMD unthreaded builds (i.e. ones for which the prepocessor flag USE_ONLY_LARGE_LEAD_RADICES = False based on the logic at top of the file); my pre-release testing apparently omitted this build mode. The error message looks like this:

../src/get_fft_radices.c: In function 'get_fft_radices':

../src/get_fft_radices.c:1446:3: error: duplicate case value

../src/get_fft_radices.c:1443:3: error: previously used here

The fix is to either +1 increment both of the two case-values in the #ifndef USE_ONLY_LARGE_LEAD_RADICES - wrapped logic following the above line, or to use this patched version of the file (15KB, md5 checksum = db5d2504d58897229d0366f4749b4131) from my dev-branch, which fixes the duplicate case error and further adds some radix sets which should help with non-SIMD build performance.

xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAABGdBTUEAALGPC/xhBQAAACBjSFJN
AAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAABrVBMVEXAwMAA3S8E6Cci+S4l
/SYK/C4D8zcA1S8AzC4p9jvd/92w/6tq/2AX/zcI8DkBzS8BricApCMd9i+u/6nH/8J0/2sW/0UF
3DMBvisBoyQBhBwAiB8A6Cxi/1iI/37L/8b///+Y/5Ma80oCxi0BrCcBlCABeR0BWhUD9y4p/zY2
/z6B/3k//1gIyzIBrygBmSIBghwBaRoBThIAUxIB8jYN/0IR+EQd+UwZ50YIxTABrCYBmCEBbxsB
WBQBQA4AOQsA1TAB1C8BzC4BwigBtCkBpCUBghsBXBYBRxABMQoANQkAuCoBtSkBpSUBiyEBex4B
axoBWhYBSBABNAoBLAkAMQgAoiMBlSIBkSEBiSABfhsBcRoBZBgBVRQBRRABNAsBLQ0BJgwAtigB
cxwBchsBbBsBYxgBWBUBSxEBPg4BMA4BLAgBJwwBIAoAeh0BURIBPg0BLg4BKg0BJQsBIQoAYxYB
Lw0BKgoBIwsBHgnBwcEzMzMATBAALw0ALgxsbGzExMSlpaWIiIhPT0+vr6+qqqqSkpJ6enpiYmJL
S0tISEhcXFxxcXGGhoabm5ux5ja/AAAAAXRSTlMAQObYZgAAAAFiS0dEILNrPYAAAAAHdElNRQff
Bx4EIwJ69EZ3AAAA2ElEQVQY02NgAANGIGBAAEYmZhZWNnYOOJ+Ti5uHl49fQBDKFxIW4RYVE5eQ
lJIGC8jIyskrKCopq6iqqYMVaGhqaSvq6OrpGxgaGYMETEzNzC0sraylbGzt7EECDo5Ozi6uqm42
7h6eXiABbx89X2s//4DAoOCQUJAhYeERkVHRMbFx8QmJYFuSklNS09IzMrOyc3IhDsnLN/IoiC8s
Ki6BObU0qyyhPKeiEsavqqquCa2tq66ubwBzG5vqm6uBoLm+qbGqiqGlpbWtvaOzurqru6e3r6UF
AB06NPiCOXZ9AAAAJXRFWHRkYXRlOmNyZWF0ZQAyMDE1LTA3LTI5VDE0OjE3OjQ1KzA4OjAwH6v7
tgAAACV0RVh0ZGF0ZTptb2RpZnkAMjAxNS0wNi0xMlQwNDoyOToyNyswODowMPuBHAEAAAAASUVO
RK5CYII=" />
24 Aug 2015: Beta versions of automated build tools:

Thanks to Alex Vong for these, which he created as part of a proposed Debian packaging of the Mlucas 14.1 release. The resulting tarballs also contain numerous bugfixes (mostly minor) which will also be in the upcoming v15 release.

In addition to pristine tarballs with C source files only (and manual build procedure documented below in "Download and Build the Source Code"), we provide experimental tarballs with autotools as well, which for the current release come in 2 "same package, different compression tools" tarballs: mlucas-14.1.tar.gz (3.4 MB) and mlucas-14.1.tar.xz (1.5 MB). If your Linux distro (or personal setup) has support for the Xz compression package, you'll obviously want to get the second, much-smaller-compressed tarball. Once you have downloaded the desired one of the tarballs, optionally, to verify the integrity of the downloaded file, first import the needed public key via

$ gpg --keyserver pgp.mit.edu --recv-keys 93518580

Then, depending on whether you downloaded the .gz or .xz-compressed tarball, download mlucas-14.1.tar.gz.sig or
mlucas-14.1.tar.xz.sig, respectively, to obtain the corresponding detached package signature. (Note that despite the extensions these are both simple ascii text files.)

Finally, verify the package using

$ gpg --verify mlucas-14.1.tar.gz.sig mlucas-14.1.tar.gz

or

$ gpg --verify mlucas-14.1.tar.xz.sig mlucas-14.1.tar.xz

A successful signature verification returns "Good signature from Alex Vong <e-mail address>' -- note that the ensuing

WARNING: This key is not certified with a trusted signature!

There is no indication that the signature belongs to the owner.

is expected.

The README file in the untarred package (a text file, not to be confused with this HTML page which you are reading)
contains simple instructions on how to compile, run
tests and install. If you do not understand the instructions, please
consult the INSTALL file which contains detailed instructions and
explanations. You can also search for the basic Linux make-package command sequence `./configure && make' on the
Internet.

xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABEAAAAQCAMAAADH72RtAAAABGdBTUEAALGPC/xhBQAAACBjSFJN
AAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAAaVBMVEXAwMCgoKC5lgCofQDN
tQDJrgDFqAC1kACkdwCcawDWwQDaxwDezQDRuwC9nACtgwCgcQCUXgDq4ADu5gDm2gDBogCYZACM
UgCCgoL28gD6+ACAQADi0wCxiQCERgCITABkZGSQWAAAAACTpGNYAAAAAXRSTlMAQObYZgAAAAFi
S0dEIl1lXKwAAAAHdElNRQffBx4EJDjzuQkCAAAAiUlEQVQY043PyRaDIAwFUCZFBnFgiK0t4P//
ZDmgbRdd9O1yFy8JQn8HE0Ip/poZ63o+CHEZlkqr0UzzctGoV+uU5zTA1ki6233VVfZTHtY63T+H
ALEJk1orSUpPSk18V3dRARGacOKNIVQsKeYmmE4lBXY4BWExi3CkCHl733hAijF/oGLNz6dfxXUH
13qGNX0AAAAldEVYdGRhdGU6Y3JlYXRlADIwMTUtMDctMjlUMTQ6MTc6NDUrMDg6MDAfq/u2AAAA
JXRFWHRkYXRlOm1vZGlmeQAyMDE1LTA2LTEyVDA0OjI5OjI3KzA4OjAw+4EcAQAAAABJRU5ErkJg
gg==" />

11 Dec 2014: v14.1 released.
Again thanks to Mike Vang for doing independent-build and QA work. This release features significant performance and accuracy improvements for all recent x86 platforms, and especially gains for Intel Haswell (and soon, Broadwell) users.

First, a note on version numbering: After many years frozen at 3.0x (since my x86 code was until recently wildly uncompetitive with Prime95), now that the code is within a factor of 2 performance-wise (based on head-to-head comparisons on Intel Haswell+), I have resumed version numbering according to the scheme:

Major index = year - 2000

Minor index = release # of that year, zero-indexed.

As before, a patch suffix of x, y, or z following the numeric index indicates an [alpha,beta,gamma] (experimental,unstable) code. Since I consider this release stable and it's the 2nd non-beta release of the year, thus 14.1 = [20]14.[--2][no xyz suffix].

What's new/improved:

Self-test now has prestored residues for 10000 iterations (at least though FFT length 18432K), in addition to the previously-supported 100 and 1000. As before, to use a non-default #iters (default is 100) for a given self-test range, add '-iters [1000 | 10000]' to the command line.
One no longer needs any special flags like DFT_V* for any FMA-using routines - just DUSE_THREADS to enable multithreading and DUSE_[SSE2 | AVX | AVX2] to select an x86 SIMD-vector-instruction target.
Propagated Fused multiply-add optimizations to all key discrete Fourier transform (DFT) and related arithmetic macros. "FMA everywhere" means Haswell users should see at least a 10% speedup for their AVX2 builds, compared to plain AVX.
Overall accuracy should be appreciably better, meaning users should see very few roundoff warnings, even for 10Kiter self-tests.
The program now only reports roundoff warnings if it encounters a fractional part > 0.40625 (previous was >= 0.4) during the carry step. Some self-tests (meaning exponents right at the upper limit for the given FFT length, by definition) were emitting slews of 0.40625 warnings, but as this error level is nearly always benign, I've silenced the warnings for it. Larger errors will still emit warnings as before;
The accuracy-problematic radix-11 DFTs (used to build composite leading radices such as 44 and 176) have improved accuracy in SSE2 and AVX modes, but will still emit a few roundoff warnings in longer self-tests for certain radix combos in those build types. In AVX2 mode, however, the fact that "multiplies are free" (assuming we can fuse them with adds, which we can to a very large extent in this case) allowed me implement an entirely different radix-11 algorithm which is much more multiply-heavy, but has significantly better roundoff properties. Thus AVX2 builders will see dramatically lower roundoff errors for FFT lengths using the aforementioned leading radices 44 and 176.
I added large-stride prefetching to all the carry routines, since the 2 DFTs (specifically the final-radix-pass of the inverse FFT, followed by the normalize/carry step, followed by the initial-radix-pass of the subsequent iteration's forward FFT) bookending the carry step in those access data in large strides and are thus problematic for the kinds of default data prefetching done by most x86 hardware. That "manual assist" prefetch should provide a nice boost (5-10% for me at FFT lengths in the Mdoubles range) for all build modes.

Descriptions of recent previous code releases:

General Questions:

xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABEAAAAQCAMAAADH72RtAAAABGdBTUEAALGPC/xhBQAAACBjSFJN
AAB6JgAAgIQAAPoAAACA6AAAdTAAAOpgAAA6mAAAF3CculE8AAAAaVBMVEXAwMCgoKC5lgCofQDN
tQDJrgDFqAC1kACkdwCcawDWwQDaxwDezQDRuwC9nACtgwCgcQCUXgDq4ADu5gDm2gDBogCYZACM
UgCCgoL28gD6+ACAQADi0wCxiQCERgCITABkZGSQWAAAAACTpGNYAAAAAXRSTlMAQObYZgAAAAFi
S0dEIl1lXKwAAAAHdElNRQffBx4EJDjzuQkCAAAAiUlEQVQY043PyRaDIAwFUCZFBnFgiK0t4P//
ZDmgbRdd9O1yFy8JQn8HE0Ip/poZ63o+CHEZlkqr0UzzctGoV+uU5zTA1ki6233VVfZTHtY63T+H
ALEJk1orSUpPSk18V3dRARGacOKNIVQsKeYmmE4lBXY4BWExi3CkCHl733hAijF/oGLNz6dfxXUH
13qGNX0AAAAldEVYdGRhdGU6Y3JlYXRlADIwMTUtMDctMjlUMTQ6MTc6NDUrMDg6MDAfq/u2AAAA
JXRFWHRkYXRlOm1vZGlmeQAyMDE1LTA2LTEyVDA0OjI5OjI3KzA4OjAw+4EcAQAAAABJRU5ErkJg
gg==" />
For discussions specifically about Mlucas at the above venue, see the Mlucas subforum there.

STEP 1 - DOWNLOAD AND BUILD THE CODE

To do Lucas-Lehmer tests, you'll need to build the latest Mlucas C source code release. First get the release tarball, available in 2 different-zip-based forms:

If your system has bzip2 installed, get Mlucas_12.11.2014.tbz2 -- 2.2 MB, md5 checksum = 8fffcf4222730cb831752c152974740b. (If the file extension looks unfamiliar, note that some people prefer a 'tar.bz2' extension, as would result from a 2-step 'first tar, then bzip2-compress' procedure). Then use 'tar xjf Mlucas_12.11.2014.tbz2' to one-step uncompress/unpack the archive.
Otherwise, get Mlucas_12.11.2014.tgz -- 3.2 MB, md5 checksum = d2cf3d80de68d9bc6361af7ce1942d8b. (If the file extension looks unfamiliar, note that some people prefer a 'tar.gz' extension, as would result from a 2-step 'first tar, then gzip-compress' procedure). Then use 'tar xzf Mlucas_12.11.2014.tgz' to one-step uncompress/unpack the archive.

The build procedure is so simple, I use no makefile - let's illustrate using a multithreaded x86/SSE2 build under 64-bit linux. Note that there are some inline-assembler macros which use most of the 14 available general-purpose registers and as a consequence result in ran-out-of-registers errors in unthreaded mode, thus our default build mode is multithreaded:

gcc -c -Os -m64 -DUSE_SSE2 -DUSE_THREADS *.c

rm -f rng*.o util.o qfloat.o

gcc -c -O1 -m64 -DUSE_SSE2 -DUSE_THREADS rng*.c util.c qfloat.c

gcc -o Mlucas *.o -lm -lpthread -lrt

We use -Os (targeting a small object code size) because that yields best performance for SSE2 builds on pre-AVX architectures. For AVX builds on Sandy/Ivy Bridge and AVX2 builds on Haswell/Broadwell you may get a smidge better performance using -O3 or -Ofast instead of -Os, but I have not seen a consistent benefit from doing so, and more importantly (depending on your precise compiler/runtime setup) these don't always play nice with the -mavx/-mavx2 flags (detailed below), which generally will yield a 3-5% speedup on the Sandy/Ivy Bridge and Haswell/Broadwell CPU families, respectively.

The 'rm' command line and the 'gcc -O1' following it perform a rebuild of a trio of accuracy/optimization-sensitive files at a lower opt-level. None of these files is critical for performance, so it is recommended to always do this step even though many builds of all files with the higher opt level work just fine.

You should expect to see some compiler warnings, mostly of the "type-punned pointer", "signed/unsigned int", "unused variable" and "variable set but not used" (the latter with GCC 4.7 and later) varieties. The first of these is related to the quad-float emulation code used for high-precision double-float-const-initializations. (I try to keep on top of the latter kinds but with multiple build modes which which use various partially-overlapping subsets of variables, were I to set a goal of no such warnings, it would be a nearly full-time job and leave little time for actual new-code development). Other are mainly of the following kind:

[various]: warning: cast from pointer to integer of different size

twopmodq80.c: In function `twopmodq78_3WORD_DOUBLE':

twopmodq80.c:1032: warning: right shift count >= width of type

twopmodq80.c:1032: warning: left shift count is negative

These are similarly benign - the cast warnings are due to some array-alignment code which only needs the bottom few bits of a pointer, and the shift-count warnings are a result of compiler speculation-type optimizations.

The various other (including non-x86) build modes are all slight variants of the above example procedure:

For x86/AVX (Sandy/Ivy Bridge): replace -DUSE_SSE2 with -DUSE_AVX in the above;
For x86/AVX2+FMA3 (Intel Haswell and beyond): replace -DUSE_SSE2 with -DUSE_AVX2 in the above; (sorry, no AMD FMA4 support - but AMD users can look forward to an AVX2+FMA3 - supporting chip release sometime in 2015.)
For non-x86-SIMD (i.e. scalar-double) builds, remove -DUSE_SSE2 in the above;
If using Clang under MacOS, replace 'gcc' with 'Clang' in the above, and append '-Xlinker --no-demangle' to the link step. (Note Clang does not need explicit library-linkage of the math, pthread and realtime libraries, i.e. does not need the user to invoke -lm -lpthread -lrt at link time, although doing so is not a problem).
For builds on Sandy/Ivy Bridge, you should see a small benefit from replacing '-Os' with '-Os -mavx' (GCC) or '-Os -march=core-avx' (Clang). For builds on Haswell, you should see a small benefit from replacing '-Os' with '-Os -mavx2' (GCC, but only supported in v4.7 and later; users of GCC 4.6 or earlier should use '-Os -mavx' on Haswell) or '-Os -march=core-avx2' (Clang);
For 32-bit SIMD builds (SSE2 only, no AVX) replace -m64 with -m32 in the above. NOTE that if you are using Clang in 32-bit-mode under OS X, you will likely need to use GCC to compile a small subset of files (most commonly radix*square*c in my builds on a 2008-vintage CoreDuo-based Macbook using Clang 2.0 under OS X 10.6.8) in order to work around Clang 32-bit miscompilation of those files, which lead to "Bus Error" crashes and segfaults at runtime. If you still get crashes, I suggest using GCC to build all files, even though the build time will be 3-5x larger.

Once you have successfully linked a binary, I suggest you first try a spot-check at some smallish FFT length, say

time ./Mlucas -fftlen 192 -iters 100 -radset 0 -nthread 2

This particular testcase should produce the following 100-iteration residues, with some platform-dependent variability in the roundoff errors :


100 iterations of M3888517 with FFT length 196608 = 192 K
Res64: 579D593FCE0707B2. AvgMaxErr = 0.260239955. MaxErr = 0.343750000. Program: E14.1
Res mod 2^36     =          67881076658
Res mod 2^35 - 1 =          21674900403
Res mod 2^36 - 1 =          42893438228

[If the residues differ from these internally-pretabulated 100-iteration ones, the code will emit a visually-loud error message.]

Once you have a working binary you can play with #threads ... it is most efficient to gauge the resulting effect on throughput by running one or more of the self-test FFT-length ranges (e.g. small, medium, large) with a user-set thread count, as detailed below.

STEP 2 - PERFORMANCE-TUNE FOR YOUR MACHINE

For a complete list of Mlucas command line options, type 'Mlucas -h'.

After building the source code, the first thing that should be done is a set of self-tests to make sure the binary works properly on your system. During these self-tests, the code also collects various timing data which allow it to configure itself for optimal performance on your hardware. It does this by saving data about the optimal FFT radix combination at each FFT length tried in the self-test to a configuration file, named mlucas.cfg. Once this file has been generated, it will be read whenever the program is invoked to get the optimal-FFT data (specifically, the optimal set of radices into which to subdivide each FFT length) for the exponent currently being tested.

To perform the needed self-tests for a typical-user setup (which implies that you'll be either doing double-checking or first-time LL testing), first remove or rename any existing mlucas.cfg file from a previous code build/release in the run directory, then type

Mlucas -s m

This tells the program to perform a series of self-tests for FFT lengths in the 'medium' range, which currently (as of Fall 2014) means FFT lengths from 1024K-7680K, which covers Mersenne numbers with exponents from 20M - 143M. Note that the code will automatically do the needed self-tests at a given FFT length if it fails to find an mlucas.cfg file, or fails to find an entry for the FFT length in question in that file (e.g. if you get a double-check assignment for an exponent < 20M) so in fact this step is optional. but I still recommend running the above set of self-tests under unloaded or constant-load conditions before starting work on any real assignments, so as to get the most-reliable optimal-FFT data for your machine, and to be able to identify and work around any anomalous timing data. (See example below for illustration of that).

Note: the default in automated self-test mode is to use as many threads as there are detected cores ... to use some number different from that you must add the -nthread flag to the command line. This "user custom" mode also requires you to specify the desired number of self-test iterations; you should use one of the 3 standard values, '-iters 100', '-iters 1000' or '-iters 10000' for which the code stores pretabulated results which it uses to validate (or reject) self-test results. 100 is nice for early-stage testing since the self-test will complete in roughly 1/10th the time, but 1000 is better once you have a good idea of optimal thread count on your system, because it yields a more-precise timing and is better at catching radix sets which may yield an unsafely high level of roundoff error for exponents near the upper limit of what the code allows for a given FFT length.

Thus, to run the small, medium and large self-tests 2-threaded and with 1000 iterations per individual subtest:

./Mlucas -s s -iters 1000 -nthread 2

./Mlucas -s m -iters 1000 -nthread 2

./Mlucas -s l -iters 1000 -nthread 2

To follow that with a 4-threaded set of tests for purposes of timing comparison, first move the 2-threaded mlucas.cfg file under a different name, e.g.

mv mlucas.cfg mlucas.cfg.2thr

./Mlucas -s s -iters 1000 -nthread 4

./Mlucas -s m -iters 1000 -nthread 4

./Mlucas -s l -iters 1000 -nthread 4

mv mlucas.cfg mlucas.cfg.4thr

Once you have determined the best thread count for your system, if this differs from the number of hardware cores, to get the program to use said #nthreads you must create a file 'nthreads.ini' in the same dir as you are running in, enter the desired #threads, and save the file. Some users have reported up to 10% performance gains from using more threads than #cores on their systems (I have seen no such behavior on my Haswell or Macbook, so it is not clear yet what is responsible for such behavior), so this file-read-based thread count setting option is mostly for them.

Lastly, if you have run multiple-thread-count tests as above, prior to starting production LL-testing, link to the desired thread-specific .cfg file. For instance if you have 6 cores but find 4-threaded to yield the best timings using the above namings, 'ln -s mlucas.cfg.4thr mlucas.cfg'.

Additional Notes:

If you want to do the self-tests of the various available radix sets for one particular FFT length, enter

Mlucas -s {FFT length in K} -iters [100 | 1000 | 10000]

For instance, to test all FFT-radix combo supported for FFT length 704K for 10000 iterations each, enter

Mlucas -s 704 -iters 10000

The above single-FFT-length self-test feature is particularly handy if the binary you are using throws errors for one or more particular FFT lengths, which interrupt the complete self-test before it has a chance to complete the configuration file. In that case, after notifying me (please!) the user must skip the offending FFT length and go on to the next-higher one, and in this fashion build a .cfg file one FFT length at a time. (Note that each such test appends to any existing mlucas.cfg file, so make sure to comment out or delete any older entries for a given FFT length after running any new timing tests, if you plan to do any actual "production" LL testing.

For SIMD-enabled (SSE2 and AVX) linux/GCC builds running on x86 platforms, the GCC compiler optimizer sometimes messes up the non-SIMD-enabled carry-step FFT radices, so during self-testing of a GCC build you may see an occasional error messages of this kind:

M36987271 Roundoff warning on iteration 1, maxerr = 14.000000000000
FATAL ERROR...Halting test of exponent 36987271

As long as the SIMD-enabled radix combinations work - i.e. you get a valid mlucas.cfg file with no "gaps", such errors are ignorable.

[Dec 2011 - Update: For GCC builds I have replaced the old "add and subtract rounding constant" trick for effecting fast double-precision nearest-int with a call to the compiler instrinsic lrint macro. This - even when building in scalar (non-SSE2) mode - inlines a fast SSE2-based DNINT, which is roughly as fast as the above add/subtract trick, and most importantly, is not subject to being "optimized away" by the compiler. Thus, one should no longer see the above kinds of errors in GCC builds; the only remaining caveat related to the fused DFT-pass/normalize-and-propagate-carries routines in question is that the ones which were in the past affected by the above problem are ones where the code has not been SSE2-optimized (based on an expected cost/benefit analysis), so will be relatively slow. That is not an issue, because that simply means those radices will never make into the "golden" cfg-file sets, thus they will never be used for actually primality tests.]

If you are running multiple copies of Mlucas, a copy of the mlucas.cfg file should be placed into each working directory (i.e.wherever you have a worktodo.ini file). Note that the program can run without this file, but with a proper configuration file (in particular one which was run under unloaded or constant-load conditions) it will run optimally at each runlength.

Format of the mlucas.cfg file:

What is contained in the configuration file? Well, let's let one speak for itself. The following mlucas.cfg file was generated on a 2.8 GHz AMD Opteron running RedHat 64-bit linux. I've italicized and colorized the comments to set them off from the actual optimal-FFT-radix data:


	#
	# mlucas.cfg file
	# Insert comments as desired in lines beginning with a # or // symbol, as long as such commenting occurs below line 1, which is reserved.
	#
	# First non-comment line contains program version used to generate this mlucas.cfg file;
	14.1
	#
	# Remaining non-comment lines contain data about the optimal FFT parameters at each runlength on the host platform.
	# Each line below contains an FFT length in units of Kdoubles (i.e. the number of 8-byte floats used to store the
	# LL test residues for the exponent being tested), the best timing achieved at that FFT length on the host platform
	# and the range of per-iteration worst-case roundoff errors encountered (these should not exceed 0.35 or so), and the
	# optimal set of complex-FFT radices (whose product divided by 512 equals the FFT length in Kdoubles) yielding that timing.
	#
	2048  sec/iter =    0.134  ROE[min,max] = [0.281250000, 0.343750000]  radices =  32 32 32 32  0  0  0  0  0  0  [Any text offset from the list-ending 0 by whitespace is ignored]
	2304  sec/iter =    0.148  ROE[min,max] = [0.242187500, 0.281250000]  radices =  36  8 16 16 16  0  0  0  0  0
	2560  sec/iter =    0.166  ROE[min,max] = [0.281250000, 0.312500000]  radices =  40  8 16 16 16  0  0  0  0  0
	2816  sec/iter =    0.188  ROE[min,max] = [0.328125000, 0.343750000]  radices =  44  8 16 16 16  0  0  0  0  0
	3072  sec/iter =    0.222  ROE[min,max] = [0.250000000, 0.250000000]  radices =  24 16 16 16 16  0  0  0  0  0
	3584  sec/iter =    0.264  ROE[min,max] = [0.281250000, 0.281250000]  radices =  28 16 16 16 16  0  0  0  0  0
	4096  sec/iter =    0.300  ROE[min,max] = [0.250000000, 0.312500000]  radices =  16 16 16 16 32  0  0  0  0  0

Note that as of Jun 2014 the per-iteration timing data written to mlucas.cfg file have been changed from seconds to milliseconds, but that change in scaling is immaterial with respect to the notes below.

You are free to modify or append data to the right of the # signs in the .cfg file and to add or delete comment lines beginning with a # as desired. For instance, one useful thing is to add information about the specific build and platform at the top of the file. Any text to the right of the 0-terminated radices list for each FFT length is similarly ignored, whether it is preceded by a # or // or not. (But there must be a whitespace separator between the list-ending 0 and any following text).

One important thing to look for in a .cfg file generated on your local system is non-monotone timing entries in the sec/iter (seconds per iteration at the particular FFT length) data. for instance, consider the following snippet from an example mlucas.cfg file (to which I've added some boldface highlighting):


	1536  sec/iter =    0.225
	1664  sec/iter =    0.244
	1792  sec/iter =    0.253
	1920  sec/iter =    0.299
	2048  sec/iter =    0.284

We see that the per-iteration time for runlength 1920K is actually greater than that for the next-larger vector length that follows it. If you encounter such occurrences in the mlucas.cfg file generated by the self-test run on your system, don't worry about it -- when parsing the cfg file the program always "looks one FFT length beyond" the default one for the exponent in question. If the timing for the next-larger-available runlength is less than that for the default FFT length, the program will use the larger runlength. The only genuinely problematic case with this scheme is if both the default and next-larger FFT lengths are slower than an even larger runlength further down in the file, but this scenario is exceedingly rare. (If you do encounter it, please notify the author and in the meantime just let the run proceed).

Aside: This type of thing most often occurs for FFT lengths with non-power-of-2 leading radices (which are algorithmically less efficient than power-of-2 radices) just slightly less than a power-of-2 FFT length (e.g. 2048K = 2²¹), and for FFT lengths involving a radix which is an odd prime greater than 7. It can also happen if for some reason the compiler does a relatively poorer job of optimization on a particular FFT radix, or if some FFT radix combinations happen to give better or worse memory-access and cache behavior on the system in question. Such nonmonotonicities have gotten more rare with each recent Mlucas release, and especially so at larger (say, > 1024K) FFT lengths, but they do still crop up now and again.

STEP 3 - RESERVE EXPONENTS FROM PRIMENET

Assuming your self-tests ran successfully, reserve a range of exponents from the GIMPS PrimeNet server. Here's the procedure (for less-experienced users, I suggest toggling between the PrimeNet links and my explanatory comments):

A) CREATE AN ACCOUNT. If you do not already have a PrimeNet account, you must create one. PrimeNet will not check out test exponent assignments to you or accept results without an account.
B) CHECK OUT EXPONENTS. After logging in to your Primenet account, go to the IPS Manual Test Assignments Check Out. page and enter the number of cores (e.g. a Core2Duo machine is 2 cores) and the desired assignment type. (Mlucas users should select "World record tests", "Smallest available first-time tests" or "Double-check tests").

Each PrimeNet work assigment output line is in the form

{assignment type}={Unique assignment ID},{Mersenne exponent},{known to have no factors with base-2 logarithm less than},{p-1 factoring has/has-not been tried}

A pair of typical assignments returned by the server follows:

Assigment	Explanation
Test=DDD21F2A0B252E499A9F9020E02FE232,48295213,69,0	M48295213 has not been previously LL-tested (otherwise the assignment would begin with "DoubleCheck=" instead of "Test="). The long hexadecimal string is a unique assignment ID generated by the PrimeNet v5 server as an anti-poaching measure. The ",69" indicates that M48295213 has been trial-factored to depth 2⁶⁹, and had a default amount of p-1 factoring effort done with no factors found. The 0 following the 69 indicates that p-1 still needs to be done, but Mlucas currently does not support p-1 factoring, so perform a first-time LL test of M48295213.
DoubleCheck=B83D23BF447184F586470457AD1E03AF,22831811,66,1	M22831811 has already had a first-time LL test performed, been trial-factored to a depth of 2⁶⁶, and has had p-1 factoring attempted with no small factors found, so perform a second LL test of M22831811 in order to validate the result of the initial test. (Or refute it - in case of mismatching residues for the first-time test and the double-check a triple-check assignment would be generated by the server, whose format would however still read "Doublecheck")

Copy the Test=... or DoubleCheck=... lines returned by the server into the worktodo.ini file, which must be in the same directory as the Mlucas executable (or contain a symbolic link to it) and the mlucas.cfg file. If this file does not yet exist, create it. If this file already has some existing entries, append any new ones below them.

Note that Mlucas makes no distinction between first-time LL tests and double-checks - this distinction is only important to the Primenet server.

Most exponents handed out by the PrimeNet server have already been trial-factored to the recommended depth (i.e. will be of the 'Test' or 'DoubleCheck' assignment type), so in most cases, no additional factoring effort is necessary. If you have exponents that require additional trial factoring, you'll want to either return those assignments or, if you have a fast GPU installed on your system, download the appropriate GPU client from the GPU72 project to do the trial factoring, as those platforms are now much more efficient for such work than using Prime95's TF option on a PC. Mlucas does have trial factoring capability, but that functionality requires significant added work to to make it suitable for general-public use. I plan to address that either in v15 or 16, depending on how that part of the code shapes up.

If you wish to test some non-server-assigned prime exponent, you can simple enter the raw exponent on a line by itself in the worktodo.ini file.

STEP 4 - LUCAS-LEHMER TESTING

On a Unix or Linux system, cd to the directory containing the Mlucas executable (or a link to it), the worktodo.ini and the mlucas.cfg file and type

nice ./Mlucas &

The program will run silently in background, leaving you free to do other things or to log out. Every 10000 iterations (or every 100000 if > 4 threads are used), the program writes a timing to the "p{exponent}.stat" file (which is automatically created for each exponent), and writes the current residue and all other data it needs to pick up at this point (in case of a crash or powerdown) to a pair of restart files, named "p{exponent}" and "q{exponent}." (The second is a backup, in the rare event the first is corrupt.) When the exponent finishes, the program writes the least significant 64 bits of the final residue (in hexadecimal form, just like Prime95) to the .stat and results.txt (master output) file. Any round-off or FFT convolution error warnings are written as they are detected both to the status and to the output file, thus preserving a record of them when the Lucas-Lehmer test of the current exponent is completed.

Dec 2014: The program also saves a persistent p-savefile every 10M iterations, with extensions .10M, .20M, ..., reflecting which iteration the file contains restart data for. This allows for a partial-rerun - even in parallel 10Miter subinterval reruns, if desired - in case the final result proves suspect.

ADDING NEW EXPONENTS TO THE WORKTODO.INI FILE:
You may add or modify ALL BUT THE FIRST EXPONENT (i.e. the current one) in the worktodo.ini file while the program is running. When the current exponent finishes, the program opens the file, deletes the first entry and, if there is another exponent on what was line 2 (and now is line 1), starts work on that one.

STEP 5 - SEND YOUR RESULTS TO PRIMENET

To report results (either after finishing a range, or as they come in), login to your PrimeNet account and then proceed to the Manual Test Results Check In. Paste the results you wish to report, that is, one or more lines of the results.txt file (any results which were added since your last checkin from that file) into the large window immediately below.

If for some reason you need more time than the 180-day default to complete a particular assignment, go to the Manual Test Time Extension.page and enter the assignment there.

TRACKING YOUR CONTRIBUTION

You can track your overall progress (for both automated and manual testing work) at the
PrimeNet server's producer page. Note that this does not include pre-v5-server manual test results. (That includes most of my GIMPS work, in case you were feeling personally slighted ;).

ALGORITHMIC Q & A

1) What type of an algorithm does Mlucas use?
It uses the well-known Lucas-Lehmer test for Mersenne numbers, which involves selecting an initial seed number (typically 4, but other values, such as 10, also work), then repeatedly squaring and subtracting 2, with the result of each squaring being reduced modulo M(p), the number under test. This square/subtract-2 step is done exactly p-2 times. If the result (modulo M(p)) is zero, M(p) is prime. For an excellent and much more in-depth discussion of the Lucas-Lehmer test and many other prime-related topics, please visit Chris Caldwell's web page on Mersenne numbers.
2) Where does the code spend most of its time?
Since the numbers being tested (and hence the intermediate residues in the LL test) are so large that they typically must be distributed across millions of computer words, by far the most time-consuming part of the LL test is the modular squaring.
3) How does the code accomplish the squaring?
For a large integer occupying N computer words, a simple digit-by-digit ("grammar school") multiply operation (which needs on the order of N² machine operations) is much too slow to be practical. Rather, the code uses a multiply algorithm based on discrete convolution. (Math-geek joke: A discrete convolution is a convolution which doesn't kiss and tell.) The discrete convolution algorithm is best-known from the field of signal processing, but also proves to have a surprising and very nifty application to the task of multi-precision multiplication. Long story short, recasting the bignum multiply as a discrete convolution allows one to use highly efficient discrete-convolution-effecting algorithms, the best-known of which is the fast Fourier transform (FFT), which is described in many numerical analysis texts, such as the well-known Numerical Recipes (NR) books. (NR even has a set of so-called multiprecision integer routines, but I suggest staying away from them - they're awful.) For a back-of-the-envelope-style worked example illustrating the procedure, see here.

The code also uses the now-well-known Discrete Weighted Transform technique of Crandall and Fagin to implicitly do the modding. This permits one to "fill" the digits of the input vector to the FFT-based squaring, and thus to reduce the vector length by a factor of 2 or more relative to any pre-1994 codes. For a detailed reference, see the original 1994 Crandall/Fagin DWT paper, kindly posted online by Barry Fagin. For a plausible discrete-convolution roundoff error heuristic built atop this, see this research paper - in fact Mlucas uses said heuristic to automate the selection of FFT length based on the Mersenne exponent, in the given_N_get_maxP() function in get_fft_radices.c.

The upshot is, to write the world's fastest Mersenne testing program, one must write (or make use of) the world's fastest FFT algorithm.
4) What type of FFT algorithm does Mlucas use?
Mlucas uses a custom FFT implementation written by me (EWM). I first started on this algorithmic journey in the late summer of 1996, and being a complete novice at transform-based arithmetic at the time, the first FFT routines I used were those from NR. Since then, the code has greatly evolved, and the FFT I currently use looks absolutely nothing like the original one, although it is doing basically the same thing (except for the non-power-of-2 vector length routines - NR has nothing along those lines.) In recent years I have also augmented the original generic high-level C-code FFT implementation with inline assembly code to take advantage of the more-recent x86 processors` SIMD vector processing capabilities. This more than doubles the program per-cycle throughput on AMD64 and Intel CPUs supporting SSE2 (roughly, Opteron through Core2), and nearly doubles it again on Intel CPUs supporting the newer AVX instruction set with its 256-bit-wide registers, each of which can hold 4 doubles.
(Alas, at least of this mid-2015 writing AMD's AVX-supporting offerings are woeful -- their total throughput for AVX code is typically *less* than for an SSE2 build running on the same chip. One hopes this will be remedied soon, but it's not up to me.)
5) How does the Mlucas FFT compare to other high-performance FFT implementations, such as the FFTW package?
I have not had time or desire to package the FFT core of Mlucas into a form suitable for inclusion in the FFTW benchmarks, but my own comparisons indicate that the Mlucas FFT is typically at least twice as fast as FFTW for the vector lengths of interest to Mersenne prime searchers (real vectors of length 128K and larger, where K=2¹⁰=1024) running on comparable hardware.

Happy hunting - perhaps you will be the person to discover the next Mersenne prime!

GNU Free Documentation License

Copyright (C) 2015 Ernst W. Mayer. Permission is granted to copy, distribute and/or modify this documentunder the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

README.html -- Last Revised: 07 Feb 2016

Send mail to ewmayer@aol.com

README.html