Content Filtering

  from Diary and Other Rantings
Content filtering is doomed.

I started off a while back with a bad word filter to try and prevent some real foul-mouthed pages from appearing on some of my websites. This was based pretty much on George Carlin's 'seven words', with the addition of 'multipart/', because it tends to show up in comment spam disproportionally.

Then a week or two ago, I noticed one of my 'urban' feeds was getting a bit raunchy, so I extended the word list with one I found on the net - which is a pretty good collection of raunchy innuendo, racist slurs, and hate speech terms.

I then found that one of my favorite 'tech' feeds was getting blocked for foul language. I had a look at the feed. It contained such bad terms as 'button', 'association', 'marketwatch', 'cockpit', 'documentation', etc. I'll let you figure out what parts of those words ended up getting red flags.

So I've gone back to my seven words list. Rather than block these entries completely, I also think a better idea would be to go ahead and import the articles and just mark them censored. That way the articles which get flagged because of sub-words are still available to folks with a higher tolerancy level - but will still perform the intended task of keeping really foul language away from innocent visitors to the front page.

It's an interesting challenge.