FuzzyOCR for SpamAssassin on Ubuntu
October 10th, 2007FuzzyOCR is a plugin for SpamAssassin that analyzes the content and properties of images to distinguish between normal mail and spam.
I’ve been running it on some mail servers for a few months now and I’m very happy with the results.As ever, the instructions are Ubuntu centric.
Download the latest FuzzyOCR from http://fuzzyocr.own-hero.net/
Secondly, you need a reasonable list of prerequisites
NetPBM Tools (apt-get install libnetpbm10 libnetpbm10-dev)
GifSicle (apt-get install gifsicle)
Next, GifLib/Libungif, it doesn’t really matter which.
Download the latest version, its a simple ./configure && make && make install
Now we need an OCR engine, I installed both Ocrad and Gocr, both from source, the Ubunu sources a little old.
Finally a whole bunch of Perl modules, String::Approx, Time::HiRes, MLDBM, MLDBM::Sync, Log::Agent
Optionally, you can store the images hashes in a database, if you fancy it, install DBI http://dbi.perl.org] and DBD::mysql.
Now we can configure FuzzyOCR, put the FuzzyOcr.cf, FuzzyOcr.scansets, FuzzyOcr.preps and the FuzzyOcr.pm files, as well as the FuzzyOcr/ folder into /etc/mail/spamassassin.
Have a read of FuzzyOcr.cf, I made a few changes like change the log directory path and such.
I’d recommend changing
focr_enable_image_hashing
to
focr_enable_image_hashing 2
Which stores the hashes in the MLDBM database.
Create a word list, I just copied the FuzzyOcr.words into /etc/mail/spamassassin.
Run spamassassin -D –lint
Now we can test, download sample-mails.tar.gz from the FuzzyOCR page and extract.
Finally run
spamassassin –debug FuzzyOcr < ocr-gif.eml > /dev/null
And check for the FuzzyOCR entries in the log.
Easy, eh. If you get stuck check out the FuzzyOCR docs.
| Bookmark it del.icio.us | Reddit | Slashdot | Digg | Facebook | Technorati | Google | StumbleUpon | Window Live | Tailrank | Furl | Propeller | Yahoo |
Was this post useful to you? Let me know, buy me a beer!
Alternatively, if you're feeling impecunious, you may like to subscribe to my RSS feed, or see other articles in the Geekery, Linux category.
October 1st, 2008 at 16:55
[...] assume you are already running gocr with a setup similar to my FuzzyOCR for SpamAssassin on Ubuntu [...]