Blog
Image Spam Filtering
Date: 8/8/2006
Until something better comes along I'm filtering spam email with images using this:



So if you want to send me an image, start with a "Hello, image on the way!" so that you get into my whitelist, THEN send the image. :)

The basic issue with "image spam" is that they bypass the bayesian filter by having good "hammy" text in the body as well as the spam payload message in an image. So textual methods are doomed to failure. I breifly thought about image comparision algorithms, and I believe I could write something to weed out all the dupelicate image spam, but it would be a losing battle and the spammers would just end up varying their message more than they do now. So thats it, a blanket ban on email with images from senders I do not know.
Comments:
Josh
09/08/2006 8:35pm
Hmm, let see, an addon that would support processing the attached file with an OCR and then applying bayesian filtering to the OCR text?
Josh
10/08/2006 6:23am
Just a question, what's the first condition in that filter? I can see "mail.From.Contact.Fol???" How does that end? I think I'd like to implement a similar filter since I don't get many emails with pictures from people I don't know. Thanks. (BTW, that last post was mostly a joke)
fret
10/08/2006 6:36am
Joke? I had already seriously entertained that idea (before you mentioned it).

The field is "mail.From.Contact.Folder", and it returns the folder name of the associated contact.
fret
10/08/2006 6:39am
Btw the reason I didn't follow up the OCR idea is that if it was to work it would still be less than ideal because it would cause the hammy text to be added to the spam words database thus slowly poisioning it with words that shouldn't be there. Which would reduce the accuracy of the filter. Thus by latching to the fact it has an image we can just delete the message and not pass it into the spam corpus, thus it doesn't poision our word lists.

Ya?
fret
10/08/2006 6:40am
One other thing, I had to fix the "End With" operator for this to work correctly, so in the mean time use "Contains", works just as well.
Josh
10/08/2006 6:38pm
I was thinking that the hammy text might mess up the corpus, but what if you indiscriminately removed the text in the email and replaced it with the text ocr'd from the picture (when placing it in the Spam folder)? Then, if an item comes in with a picture, it is first checked if it is on a whitelist, then the main text is ignored and the text from the image is OCR'd. If no text comes from the OCR, then the message body is used for filtering. Would that work? What OCR program were you looking at or did you even get that far?
fret
10/08/2006 11:53pm
Yes I guess thats all possible, but it seems like a huge amount of work and also it would seem somewhat fragile. OCR isn't known for being robust.

I hadn't picked out an OCR app yet.
 
Reply
From:
Email (optional): (Will be HTML encoded to evade harvesting)
Message:
 
Remember username and/or email in a cookie.
Notify me of new posts in this thread via email.
BBcode:
[q]text[/q]
[url=link]description[/url]
[img]url_to_image[/img]
[pre]some_code[/pre]
[b]bold_text[/b]