On Poison Filtration
I recently found this in a piece of filtered-out spam awaiting its demise
in a purge queue. It was one of the ones peddling CDs full of harvested
email addresses. I won't go into the merits or truth of their claims, but
here's an excerpt:
Remember those 200 million lines of addresses, here's what
we did with them...
1. Cleaned and eliminated all duplicates. This process,
alone, reduced the list into a manageable number.
2. Next, we brought in a filter list of 400+ words/phrases
to clean even more. No addresses with inappropriate or
profane wording survived!
3. Then, a special filter file was used to eliminate the
"Web Poisoned" e-mail addresses from the list. Our
EXCLUSIVE system reduced these "poison" addresses to near
4. Next we used our private database of thousands of known
"extremists" and kicked off every one we could find. NOTE:
We maintain the world's largest list of individuals and
groups that are opposed to any kind of commercial
e-marketing... they are gone, nuked!
On the one hand, most of this doesn't deserve the dignity of a response,
or indeed a repository anywhere outside /dev/null. However, they do cover,
approximately, the normal known methods for sanitizing spam databases. So
here's some brief correlated notes:
- Duplicate elimination: most of us that do legitimate programming work
call this "hashing," mostly because a hash table is one of the quicker ways to
render data unique. On a UNIX box it'd be quickest just to pipe the thing
through sort -u, but most spammers are subcompetents on Windoze machines with
no engineering experience. Be that as it may, it's irrelevant for poisoning
- There are two obvious counters to munged email addresses, e.g.
email@example.com. One is simply to drop any address that matches a
possibly-munged word, of which the word 'spam' itself is almost certainly
the most common. That works, kills off a fair number of legitimate
addresses while also getting a significant portion of the munged ones.
The other is to try to reverse the munging. Simply removing the word spam
and adjusting nearby separators (@, .) is easy enough. The number of
permutations of munging is pretty large, and the more of them one writes
algorithms to counter the more legitimate data gets discarded. This ilk
sells by quantity only -- there's no benefit to the addresslist marketers to
lower their numbers in the interests of quality, which can't really be
measured in a clandestine fashion anyway. Somewhere in Usenet there's a
poster who favored the poster address
firstname.lastname@example.org; this is roughly the sort
of thing that gets filtered out; so much the better. :)
- Poison filters. Ah, the important part relative to sugarplum.
Annoyingly, this paragraph doesn't drop many hints. The term "special
filter file" is meaningless -- "file full of regular expressions" is the
closest likely analog. The same principle applies here as with de-munging;
the number of permutations is pretty large. Some is easy -- addresses where
the username is a number (numbers aren't valid usernames on UNIX hosts
anyway); addresses containing a letter-number ratio greater than, say, 0.4;
invalid TLDs, and so forth. Statistical filtration of addresses (e.g.
number-letter-punctuation ratios) will likely achieve some minimal success
with poison generated from byte-random output, e.g. email@example.com; this
will have a fairly high loss-rate of real data. Sugarplum's address
generation, as of 0.8.2 (as opposed to the fraction of its output that
consists of the addresses of known spammers) uses random dictionary words as
hostnames in the US TLDs, with randomly-generated usernames based on a
weighted sampling of letters and numbers in an RFC-valid fashion. The
weakest part of that tactic is that a DNS MX- and A-record lookup on each
address can generally remove most poisoned addresses whose names are made of
random words. A future sugarplum release may have the option to use
preselected known-good hostnames (especially those in culpable positions,
e.g. legislative bodies and ISP/NSPs who don't properly kill off their
spammers) for some fraction of the output. TLD selection is weighted in
.com's favor, and most every dictionary-word .com domain is taken, resulting
in a lot of valid MX returns especially for one-word hostnames, e.g.
word.com. A future version of sugarplum will also offer configuration on
the frequency of the number of words joined to make the hostname -- weighted
strongly in favor of 1, f'rinstance, to increase false-positives on MX