kieranbarnes Independent PHP, WordPress & CubeCart Programmer

FuzzyOCR for SpamAssassin on Ubuntu

Posted on October 10, 2007

FuzzyOCR is a plugin for SpamAssassin that analyzes the content and properties of images to distinguish between normal mail and spam.

I've been running it on some mail servers for a few months now and I'm very happy with the results.As ever, the instructions are Ubuntu centric.

Download the latest FuzzyOCR from http://fuzzyocr.own-hero.net/

Secondly, you need a reasonable list of prerequisites

NetPBM Tools (apt-get install libnetpbm10 libnetpbm10-dev)
GifSicle (apt-get install gifsicle)

Next, GifLib/Libungif, it doesn't really matter which.
Download the latest version, its a simple ./configure && make && make install

Now we need an OCR engine, I installed both Ocrad and Gocr, both from source, the Ubunu sources a little old.

Finally a whole bunch of Perl modules, String::Approx, Time::HiRes, MLDBM, MLDBM::Sync, Log::Agent
Optionally, you can store the images hashes in a database, if you fancy it, install DBI http://dbi.perl.org] and DBD::mysql.

Now we can configure FuzzyOCR, put the FuzzyOcr.cf, FuzzyOcr.scansets, FuzzyOcr.preps and the FuzzyOcr.pm files, as well as the FuzzyOcr/ folder into /etc/mail/spamassassin.

Have a read of FuzzyOcr.cf, I made a few changes like change the log directory path and such.
I'd recommend changing
focr_enable_image_hashing
to
focr_enable_image_hashing 2
Which stores the hashes in the MLDBM database.

Create a word list, I just copied the FuzzyOcr.words into /etc/mail/spamassassin.

Run spamassassin -D --lint

Now we can test, download sample-mails.tar.gz from the FuzzyOCR page and extract.
Finally run

spamassassin --debug FuzzyOcr < ocr-gif.eml > /dev/null

And check for the FuzzyOCR entries in the log.

Easy, eh. If you get stuck check out the FuzzyOCR docs.


Related posts

  1. FuzzyOCR inspired PDF scanning for SpamAssassin
    I've just stumbled over a PDF scanning engine for SpamAssassin. In light of the recent...
  2. SpamAssassin site wide spam learning
    SpamAssassin is great. I wouldn't run a mail server without it. Obviously it isn't 100%...
  3. SpamAssassin: How to protect against current spam attacks
    Christopher J. Buckley has posted a good article on protecting against current spam attacks. Go...
  4. Installing VMware Server & MUI on Ubuntu 7.10
    Installing VMware on Ubuntu 7.10 isn't as easy as Ubuntu usually makes things out to...
  5. Install Imagemagick / Imagick for PHP on Ubuntu
    No problem if you want to install imagemagick on your server, Ubuntu makes this very...

Posted by Kieran