123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231 |
- .TH ENT "1" "July 2007" "ent" "http://www.fourmilab.ch/random/"
- .SH NAME
- \fBent\fR \- pseudorandom number sequence test
- .PP
- This page describes a program, \fBent\fR, which applies various tests to
- sequences of bytes stored in files and reports the results of those tests.
- The program is useful for those evaluating pseudorandom number generators
- for encryption and statistical sampling applications, compression
- algorithms, and other applications where the information density of a file
- is of interest.
- .SH SYNOPSIS
- \fBent\fR [ \-bcftu ] [ \fIinfile\fR ]
- .SH DESCRIPTION
- \fBent\fR performs a variety of tests on the stream of bytes in \fIinfile\fR (or
- standard input if no \fIinfile\fR is specified) and produces output as follows
- on the standard output stream:
- .PP
- .nf
- Entropy = 7.980627 bits per character.
- Optimum compression would reduce the size
- of this 51768 character file by 0 percent.
- Chi square distribution for 51768 samples is 1542.26, and randomly
- would exceed this value 0.01 percent of the times.
- Arithmetic mean value of data bytes is 125.93 (127.5 = random).
- Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
- Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).
- .fi
- .PP
- The values calculated are as follows:
- .PP
- Entropy
- .PP
- The information density of the contents of the file, expressed as
- a number of bits per character. The results above, which resulted
- from processing an image file compressed with JPEG, indicate that
- the file is extremely dense in information -- essentially random.
- Hence, compression of the file is unlikely to reduce its size. By
- contrast, the C source code of the program has entropy of about
- 4.9 bits per character, indicating that optimal compression of the
- file would reduce its size by 38%. \fB[Hamming, pp. 104-108]\fR
- .PP
- Chi-square Test
- .PP
- The chi-square test is the most commonly used test for the
- randomness of data, and is extremely sensitive to errors in
- pseudorandom sequence generators. The chi-square distribution is
- calculated for the stream of bytes in the file and expressed as an
- absolute number and a percentage which indicates how frequently a
- truly random sequence would exceed the value calculated. We
- interpret the percentage as the degree to which the sequence
- tested is suspected of being non-random. If the percentage is
- greater than 99% or less than 1%, the sequence is almost certainly
- not random. If the percentage is between 99% and 95% or between 1%
- and 5%, the sequence is suspect. Percentages between 90% and 95%
- and 5% and 10% indicate the sequence is "almost suspect". Note
- that our JPEG file, while very dense in information, is far from
- random as revealed by the chi-square test.
- .PP
- Applying this test to the output of various pseudorandom sequence
- generators is interesting. The low-order 8 bits returned by the
- standard Unix rand() function, for example, yields:
- .PP
- .nf
- Chi square distribution for 500000 samples is 0.01, and randomly
- would exceed this value 99.99 percent of the times.
- .fi
- .PP
- While an improved generator \fB[Park & Miller]\fR reports:
- .PP
- .nf
- Chi square distribution for 500000 samples is 212.53, and
- randomly would exceed this value 95.00 percent of the times.
- .fi
- .PP
- Thus, the standard Unix generator (or at least the low-order bytes
- it returns) is unacceptably non-random, while the improved
- generator is much better but still sufficiently non-random to
- cause concern for demanding applications. Contrast both of these
- software generators with the chi-square result of a genuine random
- sequence created by timing radioactive decay events.
- .PP
- .nf
- Chi square distribution for 32768 samples is 237.05, and
- randomly would exceed this value 75.00 percent of the times.
- .fi
- .PP
- See \fB[Knuth, pp. 35-40]\fR for more information on the chi-square
- test. An interactive chi-square calculator is available at this
- site.
- .PP
- Arithmetic Mean
- .PP
- This is simply the result of summing the all the bytes (bits if
- the \fB\-b\fR option is specified) in the file and dividing by the file
- length. If the data are close to random, this should be about
- 127.5 (0.5 for \fB\-b\fR option output). If the mean departs from this
- value, the values are consistently high or low.
- .PP
- Monte Carlo Value for Pi
- .PP
- Each successive sequence of six bytes is used as 24 bit X and Y
- co-ordinates within a square. If the distance of the
- randomly-generated point is less than the radius of a circle
- inscribed within the square, the six-byte sequence is considered a
- "hit". The percentage of hits can be used to calculate the value
- of Pi. For very large streams (this approximation converges very
- slowly), the value will approach the correct value of Pi if the
- sequence is close to random. A 32768 byte file created by
- radioactive decay yielded:
- .PP
- .nf
- Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).
- .fi
- .PP
- Serial Correlation Coefficient
- .PP
- This quantity measures the extent to which each byte in the file
- depends upon the previous byte. For random sequences, this value
- (which can be positive or negative) will, of course, be close to
- zero. A non-random byte stream such as a C program will yield a
- serial correlation coefficient on the order of 0.5. Wildly
- predictable data such as uncompressed bitmaps will exhibit serial
- correlation coefficients approaching 1. See \fB[Knuth, pp. 64-65]\fR for
- more details.
- .SH OPTIONS
- .IP \fB\-b\fR
- The input is treated as a stream of bits rather than of 8-bit
- bytes. Statistics reported reflect the properties of the
- bitstream.
- .IP \fB\-c\fR
- Print a table of the number of occurrences of each possible byte
- (or bit, if the \fB\-b\fR option is also specified) value, and the
- fraction of the overall file made up by that value. Printable
- characters in the ISO 8859-1 Latin1 character set are shown along
- with their decimal byte values. In non-terse output mode, values
- with zero occurrences are not printed.
- .IP \fB\-f\fR
- Fold upper case letters to lower case before computing statistics.
- Folding is done based on the ISO 8859-1 Latin1 character set, with
- accented letters correctly processed.
- .IP \fB\-t\fR
- Terse mode: output is written in Comma Separated Value (CSV)
- format, suitable for loading into a spreadsheet and easily read by
- any programming language. See Terse Mode Output Format below for
- additional details.
- .IP \fB\-u\fR
- Print how-to-call information.
- .SH FILES
- If no \fIinfile\fR is specified, \fBent\fR obtains its input from standard input.
- Output is always written to standard output.
- .SH TERSE MODE OUTPUT FORMAT
- Terse mode is selected by specifying the \fB\-t\fR option on the command line.
- Terse mode output is written in Comma Separated Value (CSV) format, which
- can be directly loaded into most spreadsheet programs and is easily read
- by any programming language. Each record in the CSV file begins with a
- record type field, which identifies the content of the following fields.
- If the \fB\-c\fR option is not specified, the terse mode output will consist of
- two records, as follows:
- .PP
- .nf
- 0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation
- 1,file_length,entropy,chi_square,mean,Pi_value,correlation
- .fi
- .PP
- where the italicised values in the type 1 record are the numerical values
- for the quantities named in the type 0 column title record. If the \fB\-b\fR
- option is specified, the second field of the type 0 record will be
- "File-bits", and the file_length field in type 1 record will be given in
- bits instead of bytes. If the \fB\-c\fR option is specified, additional records
- are appended to the terse mode output which contain the character counts:
- .PP
- .nf
- 2,Value,Occurrences,Fraction
- 3,v,count,fraction
- . . .
- .fi
- .PP
- If the \fB\-b\fR option is specified, only two type 3 records will appear for the
- two bit values v=0 and v=1. Otherwise, 256 type 3 records are included,
- one for each possible byte value. The second field of a type 3 record
- indicates how many bytes (or bits) of value v appear in the input, and
- fraction gives the decimal fraction of the file which has value v (which
- is equal to the count value of this record divided by the file_length
- field in the type 1 record).
- .SH BUGS
- Note that the "optimal compression" shown for the file is computed from
- the byte- or bit-stream entropy and thus reflects compressibility based on
- a reading frame of the chosen width (8-bit bytes or individual bits if the
- \fB\-b\fR option is specified). Algorithms which use a larger reading frame, such
- as the Lempel-Ziv \fB[Lempel & Ziv]\fR algorithm, may achieve greater
- compression if the file contains repeated sequences of multiple bytes.
- .SH SEE ALSO
- \fIIntroduction to Probability and Statistics\fR
- .br
- http://www.fourmilab.ch/rpkp/experiments/statistics.html
- .PP
- \fB[Hamming]\fR
- .br
- Hamming, Richard W. \fICoding and Information Theory.\fR Englewood
- Cliffs NJ: Prentice-Hall, 1980.
- .PP
- \fB[Knuth]\fR
- .br
- Knuth, Donald E. \fIThe Art of Computer Programming, Volume 2 /
- Seminumerical Algorithms\fR. Reading MA: Addison-Wesley, 1969. ISBN
- 0-201-89684-2.
- .PP
- \fB[Lempel & Ziv]\fR
- .br
- Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data
- Compression". \fIIEEE Transactions on Information Theory\fR \fB23\fR, 3,
- pp. 337-343.
- .PP
- \fB[Park & Miller]\fR
- .br
- Park, Stephen K. and Keith W. Miller. "Random Number Generators:
- Good Ones Are Hard to Find". \fICommunications of the ACM\fR, October
- 1988, p. 1192.
- .SH COPYING
- This software is in the public domain. Permission to use, copy, modify,
- and distribute this software and its documentation for any purpose and
- without fee is hereby granted, without any conditions or restrictions.
- This software is provided "as is" without express or implied warranty.
- .SH AUTHOR
- John Walker
- .br
- October 20th, 1998
|