alexvong1995
/
mlucas


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
							Please check http://hogranch.com/mayer/README.html#news for latest news.

07 Feb 2016: Patched source file:

        Several users have reported a bug in the get_fft_radices.c file,
        which affects non-SIMD unthreaded builds (i.e. ones for which the
        prepocessor flag USE_ONLY_LARGE_LEAD_RADICES = False based on the logic
        at top of the file); my pre-release testing apparently omitted this
        build mode. The error message looks like this:

                ../src/get_fft_radices.c: In function 'get_fft_radices':
                ../src/get_fft_radices.c:1446:3: error: duplicate case value
                ../src/get_fft_radices.c:1443:3: error: previously used here

        The fix is to either +1 increment both of the two case-values in the
        #ifndef USE_ONLY_LARGE_LEAD_RADICES - wrapped logic following the above
        line, or to use this patched version of the file
        (15KB, md5 checksum = db5d2504d58897229d0366f4749b4131) from my
        dev-branch, which fixes the duplicate case error and further adds some
        radix sets which should help with non-SIMD build performance.

24 Aug 2015: Beta versions of automated build tools:

        Thanks to Alex Vong for these, which he created as part of a proposed
        Debian packaging of the Mlucas 14.1 release. The resulting tarballs
        also contain numerous bugfixes (mostly minor) which will also be in the
        upcoming v15 release.

        In addition to pristine tarballs with C source files only (and manual
        build procedure documented below in
        "Download and Build the Source Code"), we provide experimental tarballs
        with autotools as well, which for the current release come in 2
        "same package, different compression tools" tarballs:
        mlucas-14.1.tar.gz (3.4 MB) and mlucas-14.1.tar.xz (1.5 MB). If your
        Linux distro (or personal setup) has support for the Xz compression
        package, you'll obviously want to get the second,
        much-smaller-compressed tarball. Once you have downloaded the desired
        one of the tarballs, optionally, to verify the integrity of the
        downloaded file, first import the needed public key via

                $ gpg --keyserver pgp.mit.edu --recv-keys 93518580

        Then, depending on whether you downloaded the .gz or .xz-compressed
        tarball, download mlucas-14.1.tar.gz.sig or mlucas-14.1.tar.xz.sig,
        respectively, to obtain the corresponding detached package signature.
        (Note that despite the extensions these are both simple ascii text
        files.)

        Finally, verify the package using

                $ gpg --verify mlucas-14.1.tar.gz.sig mlucas-14.1.tar.gz

        or

                $ gpg --verify mlucas-14.1.tar.xz.sig mlucas-14.1.tar.xz

        A successful signature verification returns
        "Good signature from Alex Vong <e-mail address>' --
        note that the ensuing

                WARNING: This key is not certified with a trusted signature!
                There is no indication that the signature belongs to the owner.

        is expected.

        The README file in the untarred package (a text file, not to be
        confused with this HTML page which you are reading) contains simple
        instructions on how to compile, run tests and install. If you do not
        understand the instructions, please consult the INSTALL file which
        contains detailed instructions and explanations. You can also search
        for the basic Linux make-package command sequence `./configure && make'
        on the Internet.

11 Dec 2014: v14.1 released. Again thanks to Mike Vang for doing
independent-build and QA work. This release features significant performance
and accuracy improvements for all recent x86 platforms, and especial gains for
Intel Haswell (and soon, Broadwell) users.

First, a note on version numbering: After many years frozen at 3.0x
(since my x86 code was until recently wildly uncompetitive with Prime95),
now that I'm < 2x slower (on Haswell+), resumed version numbering according to
the scheme:

        Major index = year - 2000
        Minor index = release # of that year, zero-indexed.

As before, a patch suffix of x, y, or z following the numeric index indicates
an [alpha,beta,gamma] (experimental,unstable) code. Since I consider this
release stable and it's the 2nd release of the year,
thus 14.1 = [20]14.[--2][no xyz suffix].

What's new/improved:

        1. Self-test now has prestored residues for 10000 iterations (at least
        though FFT length 18432K), in addition to the previously-supported 100
        and 1000. As before, to use a non-default #iters (default is 100) for a
        given self-test range, add '-iters [1000 | 10000]' to the command line.

        2. One no longer needs any special flags like DFT_V* for any FMA-using
        routines - just DUSE_THREADS to enable multithreading and
        DUSE_[SSE2 | AVX | AVX2] to select an x86 SIMD-vector-instruction
        target.

        3. Propagated Fused multiply-add optimizations to all key discrete
        Fourier transform (DFT) and related arithmetic macros. "FMA everywhere"
        means Haswell users should see at least a 10% speedup for their AVX2
        builds, compared to plain AVX.

        4. Overall accuracy should be appreciably better, meaning users should
        see very few roundoff warnings, even for 10Kiter self-tests.

        5. The program now only reports roundoff warnings if it encounters a
        fractional part > 0.40625 (previous was >= 0.4) during the carry step.
        Some self-tests (meaning exponents right at the upper limit for the
        given FFT length, by definition) were emitting slews of 0.40625
        warnings, but as this error level is nearly always benign, I've
        silenced the warnings for it. Larger errors will still emit warnings as
        before.

        6. The accuracy-problematic radix-11 DFTs (used to build composite
        leading radices such as 44 and 176) have improved accuracy in SSE2 and
        AVX modes, but will still emit a few roundoff warnings in longer
        self-tests for certain radix combos in those build types. In AVX2 mode,
        however, the fact that "multiplies are free" (assuming we can fuse them
        with adds, which we can to a very large extent in this case) allowed me
        implement an entirely different radix-11 algorithm which is much more
        multiply-heavy, but has significantly better roundoff properties. Thus
        AVX2 builders will see dramatically lower roundoff errors for FFT
        lengths using the aforementioned leading radices 44 and 176.

        7. I added large-stride prefetching to all the carry routines, since
        the 2 DFTs (specifically the final-radix-pass of the inverse FFT,
        followed by the normalize/carry step, followed by the
        initial-radix-pass of the subsequent iteration's forward FFT)
        bookending the carry step in those access data in large strides and are
        thus problematic for the kinds of default data prefetching done by most
        x86 hardware. That "manual assist" prefetch should provide a nice boost
        (5-10% for me at FFT lengths in the Mdoubles range) for all build
        modes.

Copyright:
        Copyright (C) 2015 Ernst W. Mayer.
        Permission is granted to copy, distribute and/or modify
        this document under the terms of the GNU Free Documentation License,
        Version 1.3 or any later version published by
        the Free Software Foundation; with no Invariant Sections,
        no Front-Cover Texts, and no Back-Cover Texts.