ent.1 9.4 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231
  1. .TH ENT "1" "July 2007" "ent" "http://www.fourmilab.ch/random/"
  2. .SH NAME
  3. \fBent\fR \- pseudorandom number sequence test
  4. .PP
  5. This page describes a program, \fBent\fR, which applies various tests to
  6. sequences of bytes stored in files and reports the results of those tests.
  7. The program is useful for those evaluating pseudorandom number generators
  8. for encryption and statistical sampling applications, compression
  9. algorithms, and other applications where the information density of a file
  10. is of interest.
  11. .SH SYNOPSIS
  12. \fBent\fR [ \-bcftu ] [ \fIinfile\fR ]
  13. .SH DESCRIPTION
  14. \fBent\fR performs a variety of tests on the stream of bytes in \fIinfile\fR (or
  15. standard input if no \fIinfile\fR is specified) and produces output as follows
  16. on the standard output stream:
  17. .PP
  18. .nf
  19. Entropy = 7.980627 bits per character.
  20. Optimum compression would reduce the size
  21. of this 51768 character file by 0 percent.
  22. Chi square distribution for 51768 samples is 1542.26, and randomly
  23. would exceed this value 0.01 percent of the times.
  24. Arithmetic mean value of data bytes is 125.93 (127.5 = random).
  25. Monte Carlo value for Pi is 3.169834647 (error 0.90 percent).
  26. Serial correlation coefficient is 0.004249 (totally uncorrelated = 0.0).
  27. .fi
  28. .PP
  29. The values calculated are as follows:
  30. .PP
  31. Entropy
  32. .PP
  33. The information density of the contents of the file, expressed as
  34. a number of bits per character. The results above, which resulted
  35. from processing an image file compressed with JPEG, indicate that
  36. the file is extremely dense in information -- essentially random.
  37. Hence, compression of the file is unlikely to reduce its size. By
  38. contrast, the C source code of the program has entropy of about
  39. 4.9 bits per character, indicating that optimal compression of the
  40. file would reduce its size by 38%. \fB[Hamming, pp. 104-108]\fR
  41. .PP
  42. Chi-square Test
  43. .PP
  44. The chi-square test is the most commonly used test for the
  45. randomness of data, and is extremely sensitive to errors in
  46. pseudorandom sequence generators. The chi-square distribution is
  47. calculated for the stream of bytes in the file and expressed as an
  48. absolute number and a percentage which indicates how frequently a
  49. truly random sequence would exceed the value calculated. We
  50. interpret the percentage as the degree to which the sequence
  51. tested is suspected of being non-random. If the percentage is
  52. greater than 99% or less than 1%, the sequence is almost certainly
  53. not random. If the percentage is between 99% and 95% or between 1%
  54. and 5%, the sequence is suspect. Percentages between 90% and 95%
  55. and 5% and 10% indicate the sequence is "almost suspect". Note
  56. that our JPEG file, while very dense in information, is far from
  57. random as revealed by the chi-square test.
  58. .PP
  59. Applying this test to the output of various pseudorandom sequence
  60. generators is interesting. The low-order 8 bits returned by the
  61. standard Unix rand() function, for example, yields:
  62. .PP
  63. .nf
  64. Chi square distribution for 500000 samples is 0.01, and randomly
  65. would exceed this value 99.99 percent of the times.
  66. .fi
  67. .PP
  68. While an improved generator \fB[Park & Miller]\fR reports:
  69. .PP
  70. .nf
  71. Chi square distribution for 500000 samples is 212.53, and
  72. randomly would exceed this value 95.00 percent of the times.
  73. .fi
  74. .PP
  75. Thus, the standard Unix generator (or at least the low-order bytes
  76. it returns) is unacceptably non-random, while the improved
  77. generator is much better but still sufficiently non-random to
  78. cause concern for demanding applications. Contrast both of these
  79. software generators with the chi-square result of a genuine random
  80. sequence created by timing radioactive decay events.
  81. .PP
  82. .nf
  83. Chi square distribution for 32768 samples is 237.05, and
  84. randomly would exceed this value 75.00 percent of the times.
  85. .fi
  86. .PP
  87. See \fB[Knuth, pp. 35-40]\fR for more information on the chi-square
  88. test. An interactive chi-square calculator is available at this
  89. site.
  90. .PP
  91. Arithmetic Mean
  92. .PP
  93. This is simply the result of summing the all the bytes (bits if
  94. the \fB\-b\fR option is specified) in the file and dividing by the file
  95. length. If the data are close to random, this should be about
  96. 127.5 (0.5 for \fB\-b\fR option output). If the mean departs from this
  97. value, the values are consistently high or low.
  98. .PP
  99. Monte Carlo Value for Pi
  100. .PP
  101. Each successive sequence of six bytes is used as 24 bit X and Y
  102. co-ordinates within a square. If the distance of the
  103. randomly-generated point is less than the radius of a circle
  104. inscribed within the square, the six-byte sequence is considered a
  105. "hit". The percentage of hits can be used to calculate the value
  106. of Pi. For very large streams (this approximation converges very
  107. slowly), the value will approach the correct value of Pi if the
  108. sequence is close to random. A 32768 byte file created by
  109. radioactive decay yielded:
  110. .PP
  111. .nf
  112. Monte Carlo value for Pi is 3.139648438 (error 0.06 percent).
  113. .fi
  114. .PP
  115. Serial Correlation Coefficient
  116. .PP
  117. This quantity measures the extent to which each byte in the file
  118. depends upon the previous byte. For random sequences, this value
  119. (which can be positive or negative) will, of course, be close to
  120. zero. A non-random byte stream such as a C program will yield a
  121. serial correlation coefficient on the order of 0.5. Wildly
  122. predictable data such as uncompressed bitmaps will exhibit serial
  123. correlation coefficients approaching 1. See \fB[Knuth, pp. 64-65]\fR for
  124. more details.
  125. .SH OPTIONS
  126. .IP \fB\-b\fR
  127. The input is treated as a stream of bits rather than of 8-bit
  128. bytes. Statistics reported reflect the properties of the
  129. bitstream.
  130. .IP \fB\-c\fR
  131. Print a table of the number of occurrences of each possible byte
  132. (or bit, if the \fB\-b\fR option is also specified) value, and the
  133. fraction of the overall file made up by that value. Printable
  134. characters in the ISO 8859-1 Latin1 character set are shown along
  135. with their decimal byte values. In non-terse output mode, values
  136. with zero occurrences are not printed.
  137. .IP \fB\-f\fR
  138. Fold upper case letters to lower case before computing statistics.
  139. Folding is done based on the ISO 8859-1 Latin1 character set, with
  140. accented letters correctly processed.
  141. .IP \fB\-t\fR
  142. Terse mode: output is written in Comma Separated Value (CSV)
  143. format, suitable for loading into a spreadsheet and easily read by
  144. any programming language. See Terse Mode Output Format below for
  145. additional details.
  146. .IP \fB\-u\fR
  147. Print how-to-call information.
  148. .SH FILES
  149. If no \fIinfile\fR is specified, \fBent\fR obtains its input from standard input.
  150. Output is always written to standard output.
  151. .SH TERSE MODE OUTPUT FORMAT
  152. Terse mode is selected by specifying the \fB\-t\fR option on the command line.
  153. Terse mode output is written in Comma Separated Value (CSV) format, which
  154. can be directly loaded into most spreadsheet programs and is easily read
  155. by any programming language. Each record in the CSV file begins with a
  156. record type field, which identifies the content of the following fields.
  157. If the \fB\-c\fR option is not specified, the terse mode output will consist of
  158. two records, as follows:
  159. .PP
  160. .nf
  161. 0,File-bytes,Entropy,Chi-square,Mean,Monte-Carlo-Pi,Serial-Correlation
  162. 1,file_length,entropy,chi_square,mean,Pi_value,correlation
  163. .fi
  164. .PP
  165. where the italicised values in the type 1 record are the numerical values
  166. for the quantities named in the type 0 column title record. If the \fB\-b\fR
  167. option is specified, the second field of the type 0 record will be
  168. "File-bits", and the file_length field in type 1 record will be given in
  169. bits instead of bytes. If the \fB\-c\fR option is specified, additional records
  170. are appended to the terse mode output which contain the character counts:
  171. .PP
  172. .nf
  173. 2,Value,Occurrences,Fraction
  174. 3,v,count,fraction
  175. . . .
  176. .fi
  177. .PP
  178. If the \fB\-b\fR option is specified, only two type 3 records will appear for the
  179. two bit values v=0 and v=1. Otherwise, 256 type 3 records are included,
  180. one for each possible byte value. The second field of a type 3 record
  181. indicates how many bytes (or bits) of value v appear in the input, and
  182. fraction gives the decimal fraction of the file which has value v (which
  183. is equal to the count value of this record divided by the file_length
  184. field in the type 1 record).
  185. .SH BUGS
  186. Note that the "optimal compression" shown for the file is computed from
  187. the byte- or bit-stream entropy and thus reflects compressibility based on
  188. a reading frame of the chosen width (8-bit bytes or individual bits if the
  189. \fB\-b\fR option is specified). Algorithms which use a larger reading frame, such
  190. as the Lempel-Ziv \fB[Lempel & Ziv]\fR algorithm, may achieve greater
  191. compression if the file contains repeated sequences of multiple bytes.
  192. .SH SEE ALSO
  193. \fIIntroduction to Probability and Statistics\fR
  194. .br
  195. http://www.fourmilab.ch/rpkp/experiments/statistics.html
  196. .PP
  197. \fB[Hamming]\fR
  198. .br
  199. Hamming, Richard W. \fICoding and Information Theory.\fR Englewood
  200. Cliffs NJ: Prentice-Hall, 1980.
  201. .PP
  202. \fB[Knuth]\fR
  203. .br
  204. Knuth, Donald E. \fIThe Art of Computer Programming, Volume 2 /
  205. Seminumerical Algorithms\fR. Reading MA: Addison-Wesley, 1969. ISBN
  206. 0-201-89684-2.
  207. .PP
  208. \fB[Lempel & Ziv]\fR
  209. .br
  210. Ziv J. and A. Lempel. "A Universal Algorithm for Sequential Data
  211. Compression". \fIIEEE Transactions on Information Theory\fR \fB23\fR, 3,
  212. pp. 337-343.
  213. .PP
  214. \fB[Park & Miller]\fR
  215. .br
  216. Park, Stephen K. and Keith W. Miller. "Random Number Generators:
  217. Good Ones Are Hard to Find". \fICommunications of the ACM\fR, October
  218. 1988, p. 1192.
  219. .SH COPYING
  220. This software is in the public domain. Permission to use, copy, modify,
  221. and distribute this software and its documentation for any purpose and
  222. without fee is hereby granted, without any conditions or restrictions.
  223. This software is provided "as is" without express or implied warranty.
  224. .SH AUTHOR
  225. John Walker
  226. .br
  227. October 20th, 1998