Metadata: spanking clean

In the wake of all the uproar that there are these days around the metadata in Spain, I have been reviewing various tools of PDF metadata deletion. In principle, the tools analyzed work on GNU/Linux systems, but that does not mean that some may not work on other systems.

I started from a PDF created by myself. As you can see in the following image, it contains metadata (screenshot in Spanish, but I guess you get the idea):

Metadatos

The first tool tested was exiftool:

$ exiftool -all= clean.pdf

As you can see, apparently there is no metadata left (“Ninguno” means nothing):

Metadatos

However, in exiftool website you can read: “Note: Changes to PDF files are reversible because the original metadata is never actually deleted from these files. See the PDF Tags documentation for details.”. So the metadata is hidden, not deleted.

Interesting. The question is: how to recover the metadata? Those that know me, probably know the answer: Python. Using pyPdf library, I did some tests:

#!/usr/bin/env python2
from pyPdf import PdfFileReader 

fnames = ['dirty.pdf', 'exiftool.pdf'] 

for fname in fnames:
    pdf = PdfFileReader(open(fname, 'rb'))
    print fname
    print pdf.getDocumentInfo()
    print

If we run it, we get the following information:

dirty.pdf
{'/CreationDate': u"D:20131114101846+01'00'", '/ModDate': u"D:20131114101846+01'00'",
'/Producer': u'blablabla', '/Creator': u'blablabla', '/Author': u'blablabla'} 

exiftool.pdf
{'/CreationDate': u"D:20131114101846+01'00'", '/ModDate': u"D:20131114101846+01'00'",
'/Producer': u'blablabla', '/Creator': u'blablabla', '/Author': u'blablabla'}

So, what alternatives do we have? We could look after other programs or just develop one by ourselves. As you guess, I’ve done both.

First, I developed a small program (Python, of course) that deletes PDF metadata.

#!/usr/bin/env python2 

from pyPdf import PdfFileReader, PdfFileWriter
from pyPdf.generic import NameObject, createStringObject
import argparse 

parser = argparse.ArgumentParser()
parser.add_argument("input")
parser.add_argument("output")
args = parser.parse_args() 

fin = file(args.input, 'rb')
pdfIn = PdfFileReader(fin)
pdfOut = PdfFileWriter() 

for page in range(pdfIn.getNumPages()):
    pdfOut.addPage(pdfIn.getPage(page)) 

info = pdfOut._info.getObject()
del info[NameObject('/Producer')] 

fout = open(args.output, 'wb')
pdfOut.write(fout)
fin.close()
fout.close()

It can’t be easier:

./limpia.py dirty.pdf clean.pdf

The result will be a PDF file named “clean.pdf” from the PDF file “dirty.pdf”, this time with metadata deleted:

$ ./test.py
clean.pdf
{}

On the other side, we could also use MAT, that supports PDF and other filetypes:

$ mat dirty.pdf
[+] Cleaning dirty.pdf
dirty.pdf cleaned !
$ ./test.py
dirty.pdf
{}

Remember: eat vegetables and clean metadata. See you next time.