FuzzyOCR for SpamAssassin on Ubuntu

October 10th, 2007
No Gravatar

FuzzyOCR is a plugin for SpamAssassin that analyzes the content and properties of images to distinguish between normal mail and spam.

I’ve been running it on some mail servers for a few months now and I’m very happy with the results.As ever, the instructions are Ubuntu centric.

Download the latest FuzzyOCR from http://fuzzyocr.own-hero.net/

Secondly, you need a reasonable list of prerequisites

NetPBM Tools (apt-get install libnetpbm10 libnetpbm10-dev)
GifSicle (apt-get install gifsicle)

Next, GifLib/Libungif, it doesn’t really matter which.
Download the latest version, its a simple ./configure && make && make install

Now we need an OCR engine, I installed both Ocrad and Gocr, both from source, the Ubunu sources a little old.

Finally a whole bunch of Perl modules, String::Approx, Time::HiRes, MLDBM, MLDBM::Sync, Log::Agent
Optionally, you can store the images hashes in a database, if you fancy it, install DBI http://dbi.perl.org] and DBD::mysql.

Now we can configure FuzzyOCR, put the FuzzyOcr.cf, FuzzyOcr.scansets, FuzzyOcr.preps and the FuzzyOcr.pm files, as well as the FuzzyOcr/ folder into /etc/mail/spamassassin.

Have a read of FuzzyOcr.cf, I made a few changes like change the log directory path and such.
I’d recommend changing
focr_enable_image_hashing
to
focr_enable_image_hashing 2
Which stores the hashes in the MLDBM database.

Create a word list, I just copied the FuzzyOcr.words into /etc/mail/spamassassin.

Run spamassassin -D –lint

Now we can test, download sample-mails.tar.gz from the FuzzyOCR page and extract.
Finally run

spamassassin –debug FuzzyOcr < ocr-gif.eml > /dev/null

And check for the FuzzyOCR entries in the log.

Easy, eh. If you get stuck check out the FuzzyOCR docs.

Bookmark it del.icio.us | Reddit | Slashdot | Digg | Facebook | Technorati | Google | StumbleUpon | Window Live | Tailrank | Furl | Propeller | Yahoo


Was this post useful to you? Let me know, buy me a beer!
Alternatively, if you're feeling impecunious, you may like to subscribe to my RSS feed, or see other articles in the Geekery, Linux category.

One Response to “FuzzyOCR for SpamAssassin on Ubuntu”

  1. FuzzyOCR inspired PDF scanning for SpamAssassin | kieran barnes | kieranbarnes Says:

    [...] assume you are already running gocr with a setup similar to my FuzzyOCR for SpamAssassin on Ubuntu [...]

Leave a Reply