kieranbarnes Independent PHP, WordPress & CubeCart Programmer

FuzzyOCR inspired PDF scanning for SpamAssassin

Posted on October 16, 2007

I've just stumbled over a PDF scanning engine for SpamAssassin. In light of the recent PDF spam making it's way round the internet I figured I'd give it a try.

This plug in scans the PDF pdf body and embedded images. Great, huh?

Here's how I did it.Download PDF.tgz from the @Mail blog

Now install both (just to be sure!) pstotext & xpdf-utils.

apt-get install xpdf-utils pstotext

(pstotext does a better job with password protected PDFs).

I'll assume you are already running gocr with a setup similar to my FuzzyOCR for SpamAssassin on Ubuntu article.

Copy the Pdf.pm and pdf.cf files from the PDF.tgz to your SpamAssassin configuration directory. That's it.
If you have problems scanning the documents change pdftotext to pstotext in the Pdf.pm file.
Finally run,

spamassassin -D --lint

There aren't many documents at the moment, all I can find is this @Mail blog article.


Related posts

  1. FuzzyOCR for SpamAssassin on Ubuntu
    FuzzyOCR is a plugin for SpamAssassin that analyzes the content and properties of images to...
  2. SpamAssassin: How to protect against current spam attacks
    Christopher J. Buckley has posted a good article on protecting against current spam attacks. Go...
  3. SpamAssassin site wide spam learning
    SpamAssassin is great. I wouldn't run a mail server without it. Obviously it isn't 100%...
  4. “Mail option not available!”
    I spotted a strange PHP "feature" in the error_log() feature. PHP checks for sendmail functionality...
  5. How to install the nslookup, host OR dig commands in Linux?
    Sometimes you are unable to use the nslookup, host OR dig command on your Linux...

Posted by Kieran


Tagged as: Leave a comment
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

(required)

No trackbacks yet.