One of the problems with metadata is that many people are not aware that metadata even exists or could pose a threat. Because of this, metadata often leaks onto public web servers despite being easy to remove.
One tool that you can use to manipulate metadata in PDF files is the PDF Toolkit or pdftk. Pdftk is a command line tool making it a great choice for scripting. It is available for Linux, Windows, and Mac. This article will demonstrate how to use pdftk on Linux to remove metadata from PDF files. I am using Ubuntu Linux for this article but I have also used pdftk on CentOS. These directions should work on Windows or Mac but I have not tested those platforms.
I have divided this article in to three sections:
Installing The PDF Toolkit
Getting Started With pdftk
Scripting pdftk
Installing The PDF Toolkit
To get started, you will need to install PDF Toolkit (referred to as pdftk). On Ubuntu, simply go to a command prompt and enter
sudo apt-get install pdftkAPT will take care of the dependencies and install pdftk. As of this writing, the Ubuntu package is a little outdated (1.41) but still works for my purposes. If you would like the newest version, you can get it from the PDF Toolkit website at http://www.pdflabs.com/docs/install-pdftk/.
Once you have pdftk installed, you will need a PDF document to analyze. If you do not have a PDF file available, a quick Google search using the "filetype:pdf" keyword should help you get started. My file is named sample.pdf. You will need to substitute your file name in the examples.
Getting Started with pdftk
The first step is to see what metadata is in your file. The command to do this is:
pdftk sample.pdf dump_dataWhen you enter this command, you will get output similiar to the screen shot below.
While this is useful to view the data, we need to do more. We will put the metadata into a file so it can be manipulated. Use the same command with a little modification.
pdftk sample.pdf dump_data output pdf-metadataThis command will not create any output. It will create a file, pdf-metadata, that contains a copy of the metadata from sample.pdf. You will need to open the pdf-metadata file with the editor of your choice and remove the values from InfoValue. Also remove any other references like bookmarks, page labels, or ids. The pdf-metadata file should look like the screen shot below.
Save the pdf-metadata file. Now we are ready to use that data to wipe the metadata from our sample file. The command to do this is:
pdftk sample.pdf update_info pdf-metadata output sample-no-metadata.pdfThis command will also not produce any output. It takes the original sample.pdf file and create a copy named sample-no-metadata.pdf. The (lack of) metadata from pdf-metadata is used to overwrite the existing metadata. You can test this by using the command from earlier.
pdftk sample-no-metadata.pdf dump_dataYou should see much less metadata now. Pdftk adds the Producer and ModDate metadata but all of the other metadata is now gone!
Keep reading for tips on using pdftk in scripts for bulk metadata manipulation.
Scripting pdftk
The above directions are useful but there are much simpler ways to remove metadata from a single PDF document. The value of pdftk is in scripting. Below is my simple script to remove metadata from PDF documents in the /var/www/html directory structure.
#!/bin/bashYou need to modify the four lines that begin with pdf_ to match your environment. If you are familiar with Bash scripting, then you should not have trouble following this script. However, there is one part that requires further explanation. You need to create the pdf-infokeys file. This is a list of all of the InfoKey data that is found in your PDF files.
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
pdftk_path="/usr/local/bin/pdftk" # full path to pdftk binary
pdf_infokeys="~/pdf-infokeys" # full path to file containing new metadata
pdf_search_path="/var/www/html" # path to search for pdf files
pdf_temp_path="/tmp" # temporary directory
for i in $( find $pdf_search_path -type f -name "*.pdf" ); do
cp $i $pdf_temp_path/temp.pdf
$pdftk_path $pdf_temp_path/temp.pdf update_info $pdf_infokeys output $i
rm $pdf_temp_path/temp.pdf
done
IFS=$SAVEIFS
Here is a simple way to get that data. Start by opening a terminal window and running this command (modified to search your directory structure):
find /var/www/html -type f -name "*.pdf" -exec ./pdftk {} dump_data \; | \If you have password protected PDF documents, this may produce a few errors. These can be safely ignored. You will end up with a pdf-infokeys file that contains something similiar to this:
grep -i infokey | \
sort -u > ~/pdf-infokeys
InfoKey: AuthorOpen pdf-infokeys with your favorite editor (like vi) and modify it so it looks similiar to this:
InfoKey: Company
InfoKey: CreationDate
InfoKey: Creator
InfoKey: ModDate
InfoKey: Producer
InfoKey: SourceModified
InfoKey: Title
InfoKey: AuthorNow, save the file and you can use the script above.
InfoValue:
InfoKey: Company
InfoValue:
InfoKey: CreationDate
InfoValue:
InfoKey: Creator
InfoValue:
InfoKey: ModDate
InfoValue:
InfoKey: Producer
InfoValue:
InfoKey: SourceModified
InfoValue:
InfoKey: Title
InfoValue:
Hopefully you found this useful. Please feel free to leave comments or questions below. Thanks for visiting.
Note: For an explanation of the $IFS variable from the script above, check out http://www.cyberciti.biz/tips/handling-filenames-with-spaces-in-bash.html. This is a work around for dealing with spaces found in the path or file name used in bash script loops.
No comments:
Post a Comment