Detecting Spam with Regular Expressions, posted last week on the SANS site. It's a really interesting read about an idea for detecting spam, although generating the patterns takes a lot of CPU time.
I wonder what adding a new generation on given time intervals (or when x amount of new data is generated) and feeding it the new data each time would do to the reliability of the algorithm. I don't see it in the paper, but the technique seems to work with genetics. Just skimming the example code and ideas without thinking too much about them it appears possible to at least try. I wonder if it'd suffer the same issues a lot of Bayesian implementations do where the quality of filtering goes down. Filtering new input by similarity to existing data (or by similarity to the opposite data set) might be helpful. Anyone want to experiment and post the results?
Copyright ©2000-2008 Jeremy Mooney (jeremy-at-qux-dot-net)