Home

Awesome

PDF2ARCHIVE

A simple Ghostscript-based PDF to PDF/A-1B converter.

Requirements

Installation

Just download the latest release (or a zip archive of this repo if you want the bleeding-edge version) and unzip somewhere. If necessary, open up a terminal in the folder that you've extracted and run a chmod +x pdf2archive. You're done.

Usage

Take the PDF file you want to convert and put it in the folder where you've extracted the script. Open up a terminal in the same folder and run

  ./pdf2archive [options] input.pdf [output.pdf]

where input.pdf is the file you want ot convert. You may or may not specify an output file name; if you don't, the output file will be automatically called input-PDFA.pdf. You also have a number of optional arguments:

  -h, --help          Show the help
  --quality=<value>   Set  the  quality of  the  output  when  downsampling. The
                      possible values  are  'high',  'medium'  and  'low', where
                      'high' gives the highest output  quality. By specifying no
                      option, no additional downsampling is done.
  --title=<value>     Title of the resulting PDF/A file
  --author=<value>    Author of the resulting PDF/A file
  --subject=<value>   Subject of the resulting PDF/A file
  --keywords=<value>  Comma-separated keywords of the resulting PDF/A file
  --cleanmetadata     Clean  all the standard  metadata  fields, except the ones
                      specified via the command line options.
  --validate          Validate  the resulting file. The  validation is done with
                      VeraPDF, you need a working Java installation.
  --validate-only     Perform only the validation on the input file, again using
                      VeraPDF
  --debug             Write additional debug information on screen
  -v, --version       Show the program version

Examples

Convert 'input.pdf' in PDF/A-1B format; the output is 'input-PDFA.pdf':

  ./pdf2archive input.pdf

Convert 'input.pdf' in PDF/A-1B format; the output is 'output.pdf':

  ./pdf2archive input.pdf output.pdf

Convert 'input.pdf' in PDF/A-1B format and perform a high-quality compression:

  ./pdf2archive --quality=high input.pdf

Convert 'input.pdf' in PDF/A-1B format and specify the document title:

  ./pdf2archive --title="Title of your nice document" input.pdf

Convert 'input.pdf' in PDF/A-1B format and validate the result:

  ./pdf2archive --validate input.pdf

History & Motivation

This script was born as a necessity, when I had to convert the LaTeX-produced PDF of my MSc Thesis into a PDF/A-1B.

Once upon a time, the delivery of the Thesis had to be done manually, by burning a CD-ROM with the Thesis PDF on it. I don't need to say that it was extremely old-fasioned and inefficient, as you had to deliver the CD-ROM to the secretariat in person. Finally, in 2015, my university decided to activate the online submission of the PDF: you just had to upload your PDF and you were done, completely hassle-free.

Then one year ago, some enlightened mind in whoever knows what administrative office, decided that a regular PDF was not easy enough; so, the university began to require the much more satanic PDF/A-1B. Of course, they had to provide a set of instructions for us mere mortal, so that we could produce valid PDF/A-1B files; and indeed they did, by uploading a fantastic document. If you took the (click)bait and read the PDF (not PDF/A-1B, eh!) instructions at the previous linked page, you might have noticed the absolute completeness of the information contained in it: there are instructions to transform a PDF into a PDF/A-1B by either using a Windows-only free program (yeah, I know) or an obsolete OpenOffice plugin that doesn't work anymore or paid, commercial programs that work at most only on Windows and MacOS. No free, cross-platform alternative because hey, everyone loves Windows! Naturally, you can directly produce a PDF/A-1B version of your Thesis. The document lists some easy instructions to perform a direct export into a PDF/A-1B from either Microsoft Word (or Excel, because there are people who of course write their thesis in Excel) or OpenOffice. Because everyone on Earth, especially people who do Physics or Maths, write their thesis in Microsoft Word... they look sooo beautiful, in particular when you have to put footnotes, citations, table of contents, when Word spreads the text in a page in a zebra-style, and when you write those amazing equations in Comic Sans that get rendered as 10 DPI jpeg's. "And people who use LaTeX"? "Latex? What latex? I don't do that kind of dirty sex stuff"! - would say the guy who wrote that document.

So you could imagine me and my friends, on the last available day for the Thesis delivery, still struggling trying to figure out how to convert. There is a nice site that converts PDF's into PDF/A-1B files, but there are some points:

By digging around on Google, you can find people saying that you can perform the conversion via Ghostscript by just turning on a couple of switches; unfortunately, this doesn't work (the online system, Esse3, keeps saying that the file is not valid) and the matter is slightly more complicated and poorly documented. The failure in producing a valid PDF/A-1B is connected to the complex set of requirements needed, especially font embedding, metadata and color space. This script is just a collection of all the things one should to in order to obtain (in most of the cases) a valid PDF/A-1B document from a regular PDF file, in the hope that it simplifies all the process. It also contain a free, open source validator that can validate the resulting file (this validator was not included in the official instructions I've linked before, which instead point to paid, commercial products).

There are still cases in which this script produces valid PDF/A-1B files which are rejected by the system because they are "too complex" and the validator used by Esse3 (our 80's-style online system programmed by Topo Gigio) goes in timeout; unfortunately there is no solution that we can implement, as it's a server problem. The suspect is that they're using the commercial version of an online validator (which seems to be the only free one still working), which has the same timeout problem when validating "too complex" PDF files. This small script, instead, shows that

So, now the question is: what is the university still waiting for?

FAQs

Licensing