README 8.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224
  1. img2pdf
  2. Lossless conversion of raster images to PDF. You should use img2pdf if
  3. your priorities are (in this order):
  4. always lossless: the image embedded in the PDF will always have the
  5. exact same color information for every pixel as the input small: if
  6. possible, the difference in filesize between the input image and the
  7. output PDF will only be the overhead of the PDF container itself fast:
  8. if possible, the input image is just pasted into the PDF document as-is
  9. without any CPU hungry re-encoding of the pixel data
  10. Conventional conversion software (like ImageMagick) would either:
  11. not be lossless because lossy re-encoding to JPEG not be small
  12. because using wasteful flate encoding of raw pixel data not be fast
  13. because input data gets re-encoded
  14. Another advantage of not having to re-encode the input (in most common
  15. situations) is, that img2pdf is able to handle much larger input than
  16. other software, because the raw pixel data never has to be loaded into
  17. memory.
  18. For JPEG, JPEG2000, non-interlaced PNG and TIFF images with CCITT Group
  19. 4 encoded data, img2pdf directly embeds the image data into the PDF
  20. without re-encoding it. It thus treats the PDF format merely as a
  21. container format for the image data. In these cases, img2pdf only
  22. increases the filesize by the size of the PDF container (typically
  23. around 500 to 700 bytes). Since data is only copied and not re-encoded,
  24. img2pdf is also typically faster than other solutions for these input
  25. formats.
  26. For all other input types, img2pdf first has to transform the pixel data
  27. to make it compatible with PDF. In most cases, the PNG Paeth filter is
  28. applied to the pixel data. For monochrome input, CCITT Group 4 is used
  29. instead. Only for CMYK input no filter is applied before finally
  30. applying flate compression. Usage
  31. The images must be provided as files because img2pdf needs to seek in
  32. the file descriptor.
  33. If no output file is specified with the -o/--output option, output will
  34. be done to stdout. A typical invocation is:
  35. $ img2pdf img1.png img2.jpg -o out.pdf
  36. The detailed documentation can be accessed by running:
  37. $ img2pdf --help
  38. Bugs
  39. If you find a JPEG, JPEG2000, PNG or CCITT Group 4 encoded TIFF file
  40. that, when embedded into the PDF cannot be read by the Adobe Acrobat
  41. Reader, please contact me.
  42. I have not yet figured out how to determine the colorspace of
  43. JPEG2000 files. Therefore JPEG2000 files use DeviceRGB by default. For
  44. JPEG2000 files with other colorspaces, you must explicitly specify it
  45. using the --colorspace option.
  46. Input images with alpha channels are not allowed. PDF only supports
  47. transparency using binary masks but is unable to store 8-bit
  48. transparency information as part of the image itself. But img2pdf will
  49. always be lossless and thus, input images must not carry transparency
  50. information.
  51. img2pdf uses PIL (or Pillow) to obtain image meta data and to
  52. convert the input if necessary. To prevent decompression bomb denial of
  53. service attacks, Pillow limits the maximum number of pixels an input
  54. image is allowed to have. If you are sure that you know what you are
  55. doing, then you can disable this safeguard by passing the
  56. --pillow-limit-break option to img2pdf. This allows one to process even
  57. very large input images.
  58. Installation
  59. On a Debian- and Ubuntu-based systems, img2pdf can be installed from the
  60. official repositories:
  61. $ apt install img2pdf
  62. If you want to install it using pip, you can run:
  63. $ pip3 install img2pdf
  64. If you prefer to install from source code use:
  65. $ cd img2pdf/ $ pip3 install .
  66. To test the console script without installing the package on your
  67. system, use virtualenv:
  68. $ cd img2pdf/ $ virtualenv ve $ ve/bin/pip3 install .
  69. You can then test the converter using:
  70. $ ve/bin/img2pdf -o test.pdf src/tests/test.jpg
  71. For Microsoft Windows users, PyInstaller based .exe files are produced
  72. by appveyor. If you don't want to install Python before using img2pdf
  73. you can head to appveyor and click on "Artifacts" to download the latest
  74. version: https://ci.appveyor.com/project/josch/img2pdf GUI
  75. There exists an experimental GUI with all settings currently disabled.
  76. You can directly convert images to PDF but you cannot set any options
  77. via the GUI yet. If you are interested in adding more features to the
  78. PDF, please submit a merge request. The GUI is based on tkinter and
  79. works on Linux, Windows and MacOS.
  80. Library
  81. The package can also be used as a library:
  82. import img2pdf
  83. # opening from filename with open("name.pdf","wb") as f:
  84. f.write(img2pdf.convert('test.jpg'))
  85. # opening from file handle with open("name.pdf","wb") as f1,
  86. open("test.jpg") as f2: f1.write(img2pdf.convert(f2))
  87. # using in-memory image data with open("name.pdf","wb") as f:
  88. f.write(img2pdf.convert("\x89PNG...")
  89. # multiple inputs (variant 1) with open("name.pdf","wb") as f:
  90. f.write(img2pdf.convert("test1.jpg", "test2.png"))
  91. # multiple inputs (variant 2) with open("name.pdf","wb") as f:
  92. f.write(img2pdf.convert(["test1.jpg", "test2.png"]))
  93. # convert all files ending in .jpg inside a directory dirname =
  94. "/path/to/images" with open("name.pdf","wb") as f: imgs = [] for fname
  95. in os.listdir(dirname): if not fname.endswith(".jpg"): continue path =
  96. os.path.join(dirname, fname) if os.path.isdir(path): continue
  97. imgs.append(path) f.write(img2pdf.convert(imgs))
  98. # convert all files ending in .jpg in a directory and its subdirectories
  99. dirname = "/path/to/images" with open("name.pdf","wb") as f: imgs = []
  100. for r, _, f in os.walk(dirname): for fname in f: if not
  101. fname.endswith(".jpg"): continue imgs.append(os.path.join(r, fname))
  102. f.write(img2pdf.convert(imgs))
  103. # convert all files matching a glob import glob with
  104. open("name.pdf","wb") as f:
  105. f.write(img2pdf.convert(glob.glob("/path/to/*.jpg")))
  106. # writing to file descriptor with open("name.pdf","wb") as f1,
  107. open("test.jpg") as f2: img2pdf.convert(f2, outputstream=f1)
  108. # specify paper size (A4) a4inpt =
  109. (img2pdf.mm_to_pt(210),img2pdf.mm_to_pt(297)) layout_fun =
  110. img2pdf.get_layout_fun(a4inpt) with open("name.pdf","wb") as f:
  111. f.write(img2pdf.convert('test.jpg', layout_fun=layout_fun))
  112. Comparison to ImageMagick
  113. Create a large test image:
  114. $ convert logo: -resize 8000x original.jpg
  115. Convert it into PDF using ImageMagick and img2pdf:
  116. $ time img2pdf original.jpg -o img2pdf.pdf $ time convert original.jpg
  117. imagemagick.pdf
  118. Notice how ImageMagick took an order of magnitude longer to do the
  119. conversion than img2pdf. It also used twice the memory.
  120. Now extract the image data from both PDF documents and compare it to the
  121. original:
  122. $ pdfimages -all img2pdf.pdf tmp $ compare -metric AE original.jpg
  123. tmp-000.jpg null: 0 $ pdfimages -all imagemagick.pdf tmp $ compare
  124. -metric AE original.jpg tmp-000.jpg null: 118716
  125. To get lossless output with ImageMagick we can use Zip compression but
  126. that unnecessarily increases the size of the output:
  127. $ convert original.jpg -compress Zip imagemagick.pdf $ pdfimages -all
  128. imagemagick.pdf tmp $ compare -metric AE original.jpg tmp-000.png null:
  129. 0 $ stat --format="%s %n" original.jpg img2pdf.pdf imagemagick.pdf
  130. 1535837 original.jpg 1536683 img2pdf.pdf 9397809 imagemagick.pdf
  131. Comparison to pdfLaTeX
  132. pdfLaTeX performs a lossless conversion from included images to PDF by
  133. default. If the input is a JPEG, then it simply embeds the JPEG into the
  134. PDF in the same way as img2pdf does it. But for other image formats it
  135. uses flate compression of the plain pixel data and thus needlessly
  136. increases the output file size:
  137. $ convert logo: -resize 8000x original.png $ cat << END > pdflatex.tex
  138. \documentclass{article} \usepackage{graphicx} \begin{document}
  139. \includegraphics{original.png} \end{document} END $ pdflatex
  140. pdflatex.tex $ stat --format="%s %n" original.png pdflatex.pdf 4500182
  141. original.png 9318120 pdflatex.pdf
  142. Comparison to podofoimg2pdf
  143. Like pdfLaTeX, podofoimg2pdf is able to perform a lossless conversion
  144. from JPEG to PDF by plainly embedding the JPEG data into the pdf
  145. container. But just like pdfLaTeX it uses flate compression for all
  146. other file formats, thus sometimes resulting in larger files than
  147. necessary.
  148. $ convert logo: -resize 8000x original.png $ podofoimg2pdf out.pdf
  149. original.png stat --format="%s %n" original.png out.pdf 4500181
  150. original.png 9335629 out.pdf
  151. It also only supports JPEG, PNG and TIF as input and lacks many of the
  152. convenience features of img2pdf like page sizes, borders, rotation and
  153. metadata. Comparison to Tesseract OCR
  154. Tesseract OCR comes closest to the functionality img2pdf provides. It is
  155. able to convert JPEG and PNG input to PDF without needlessly increasing
  156. the filesize and is at the same time lossless. So if your input is JPEG
  157. and PNG images, then you should safely be able to use Tesseract instead
  158. of img2pdf. For other input, Tesseract might not do a lossless
  159. conversion. For example it converts CMYK input to RGB and removes the
  160. alpha channel from images with transparency. For multipage TIFF or
  161. animated GIF, it will only convert the first frame.