FuzzyOCR inspired PDF scanning for SpamAssassin

October 16th, 2007
No Gravatar

I’ve just stumbled over a PDF scanning engine for SpamAssassin. In light of the recent PDF spam making it’s way round the internet I figured I’d give it a try.

This plug in scans the PDF pdf body and embedded images. Great, huh?

Here’s how I did it.Download PDF.tgz from the @Mail blog

Now install both (just to be sure!) pstotext & xpdf-utils.

apt-get install xpdf-utils pstotext

(pstotext does a better job with password protected PDFs).

I’ll assume you are already running gocr with a setup similar to my FuzzyOCR for SpamAssassin on Ubuntu article.

Copy the Pdf.pm and pdf.cf files from the PDF.tgz to your SpamAssassin configuration directory. That’s it.
If you have problems scanning the documents change pdftotext to pstotext in the Pdf.pm file.
Finally run,

spamassassin -D –lint

There aren’t many documents at the moment, all I can find is this @Mail blog article.

Bookmark it del.icio.us | Reddit | Slashdot | Digg | Facebook | Technorati | Google | StumbleUpon | Window Live | Tailrank | Furl | Propeller | Yahoo


Was this post useful to you? Let me know, buy me a beer!
Alternatively, if you're feeling impecunious, you may like to subscribe to my RSS feed, or see other articles in the Geekery, Linux category.

Leave a Reply