Mike Macgirvin
Diary and Other Rantings
Beyond Silicon Valley
   
Sunday, Jul 20 2008, 10:01 am
Oct 25, 2006
parental controls

The latest feature drop is a parental controls infrastructure. You can find out more on (towards the bottom of) the help page. One of the key components is a tool to allow you to turn any of the site features on or off - either for you or for your children. This combined with subscriptions lets you fine tune exactly what you wish to see when you visit. I've been using this immediately to provide different feature sets to the music portal vs. the community portal vs. software engineering portals.

As always, this is just a hint at the types of things you can do. Any member for instance can turn off chat, the photo ratings, guitar chord charts, or anything else they don't care about.

I think I'm finally running out of website features to implement... 

 

 

Comments? | More Actions Open/Close menu
Oct 20, 2006
Content Filtering

Content filtering is doomed.

I started off a while back with a bad word filter to try and prevent some real foul-mouthed pages from appearing on some of my websites. This was based pretty much on George Carlin's 'seven words', with the addition of 'multipart/', because it tends to show up in comment spam disproportionally.

Then a week or two ago, I noticed one of my 'urban' feeds was getting a bit raunchy, so I extended the word list with one I found on the net - which is a pretty good collection of raunchy innuendo, racist slurs, and hate speech terms.

I then found that one of my favorite 'tech' feeds was getting blocked for foul language. I had a look at the feed. It contained such bad terms as 'button', 'association', 'marketwatch', 'cockpit', 'documentation', etc. I'll let you figure out what parts of those words ended up getting red flags.

So I've gone back to my seven words list. Rather than block these entries completely, I also think a better idea would be to go ahead and import the articles and just mark them censored. That way the articles which get flagged because of sub-words are still available to folks with a higher tolerancy level - but will still perform the intended task of keeping really foul language away from innocent visitors to the front page.

It's an interesting challenge.  

Comments? | More Actions Open/Close menu
Oct 16, 2006
Fuzzy Tools

I've been watching the whole bruhaha at Seattle911.com with some interest. This is a homegrown website that takes live 911 feeds and puts them on a Google mashup. Cute and clever use of technology. The Seattle Fire Department responded by changing the logs to image format rather than text.

That's the background. Reading some of the articles pointed me to 'gocr' which is a free OCR package. Now this is useful - I wasn't earlier aware of its existence. It basically takes an image and tries to distinguish text in the image and gives you the text. If you saw my article on comment spam, you'll realize that 'captcha' images to prevent spam are doomed. This is where you type into a box the letters you see in the picture. Most of these are annoying anyway, but it's pretty hard to get them through a clever command-line driven OCR program. If you make it so hard to read that gocr can't read it, chances are that none of your audience will be able to either.

But I have an even deeper interest in this stuff.  Gocr is a framework for finding recognizable stuff in images. Something the world has needed for a while now is something that can filter porn. In theory there isn't much difference between distinguishing the letter 'b' in a picture (in any of 600 different fonts) and say a breast (in any of 600 different sizes/shapes). I'm being polite. Any anatomical feature. 

Some folks worked on this problem back in the '80's, correlating the prevalance of what could be termed 'skin tones' in an image.

The tools and concepts are out there. It shouldn't take much more than a man month or three to put them together into a porn filter. There's probably a market for such a thing.

OK, gocr is probably encumbered with the GNU General Public License. So maybe there isn't much of a market unless one just uses the general pattern recognition concepts (but not the code) and starts from scratch. I don't have anything against the GPL. It serves its purpose, but it does make it hard to re-use code in the workplace. It's a bummer to always have to start from scratch, when the software already exists and has been pretty much debugged. If I'm releasing code into the public domain, I always use either the Berkeley/Stanford license or no license at all. Free, no warranty, blah blah. The GPL is basically a self-replicating virus - which was written by lawyers instead of geeks. 

 

Comments:

mike
February 22, 2008 14:54
[*TOP MEMBER*] mike

gocr, is rubbish. it's been around forever and is still rubbish. it cant read clear text reliably how could it foil capcha? It's funny your article is about filtering porn too.. i just read today about a flop of a program australia spent what? 85million on that a kid (who it was meant to protect) circumvented in what? minutes? stop wasting your time! ahah

 

oh my, i read the rest of the post. at first i thought you were only misinformed, but you're clearly an idiot. :( poor guy.


mike (Mike Macgirvin)
February 22, 2008 20:03
mike

Ya' know I was going to agree with you that gocr is absolute junk, which I found out after fifteen minutes of evaluation back in 2006. It doesn't do any kind of fuzzy match and just looks at pixel comparisons in functions something like compare_a(), compare_b(), etc. Hardly general purpose pattern matching algorithms. I've since posted about doing this kind of pattern matching with heuristics, but it really doesn't matter because then you went into name calling and insult mode and I really don't owe you the time of day. 

Building software involves taking into account a great many factors - of which the license terms can be important if you're (as I was at the time) working for a corporation that has banned anything tainted with GPL. These employers leave you no choice but to build from scratch - which isn't a problem if you're any good at the job and want to continue receiving a paycheck. I do know a bit about the GPL - I've worked with it since the first draft of the license back in the late 80s. I choose not to use legal atrocities like this for my own projects. This makes them legally usable by more people than so-called 'open-source'. 

Bugger off.


Comments? | More Actions Open/Close menu
Oct 16, 2006
Future of Online Gambling(?)

You may or may not be aware of the law which was signed by Bush a couple of days ago. It basically makes it illegal to take money from Americans in offshore gambling websites. This has resulted in chaos in the gaming industry (which is huge, 6 billion a year by industry estimates). Companies which were changing hands for hundreds of millions of dollars a couple of weeks ago suddenly went on the block and sold for a buck.

But it is interesting that the sites didn't actually close, and the companies actually did sell to somebody rather than close entirely (like my music store last year). This means that they will continue to operate.

The first big change is that many of them are now refusing to take money from Americans - and will just take their money from anywhere else in the world. This is temporary while the lawyers work on the problem. Turns out the big money in online poker comes from - you guessed it... Americans.

Maybe you remember all the assault weapons bans of ten-fifteen years ago. These likewise were huge money making ventures that suddenly were faced with no revenue stream. This is where the lawyers make their money. Every law has a loophole. The job of the lawyers is to find all the loopholes. You can buy pretty much any assault rifle today that you could've fifteen years ago. The only difference is that you might not get a flash suppresor or a pistol grip; which you would then buy from somebody else if you really wanted one. 

In a few months we'll come back and see that online poker is thriving, just as it always was - and they'll be taking money from Americans, just as they always have. We've found time and time again that you can't outlaw easy money. If the laws are unbeatable, the money will just go underground. It will always flow. But relax, we're nowhere close to underground gambling. You might have to wire your money to an insurance fund in Madagascar, but rest assured that the lawyers will find a completely legal way to take all that easy money off your hands.  

Comments? | More Actions Open/Close menu
Oct 09, 2006
mp3.com

In my quest for embeddable flash music players that aren't annoying, I somehow ended up at mp3.com. I have some history with them, as I used to host my Maxwell Silverthorn recordings there.

That is until they got bought out by C|Net a few years ago (and wasn't C|Net in turn bought by somebody else? I forget...). Then they stopped supporting 'free artists' and decided that making money was more important. This was, after all, during the death of dot-com version 1, when it was decided that making money was actually a good thing for a business to do.

So with little warning, I was dropped by mp3.com. And this little website called garageband.com ended up buying or taking all of the once free material from mp3.com and offering me a membership on their site instead. I took it. Mp3.com then became one of the folks competing against Napster and iTunes and a hundred other sites to try and get licenses with all the major music studios and make real bucks. They never quite made it into the top tiers of that battle, leaving them desperate for new business opportunities.. 

That's the history.

So today when I go to mp3.com, I see an ad for a brand new feature they're beta testing which allows independant/free music artists to upload their own material and have it hosted on mp3.com. And they can get their own artists pages with a blog and bio page. Wow! This is like deja vu all over again!  I used to have such a thing at mp3.com until they threw it away.

 

Comments? | More Actions Open/Close menu
Oct 08, 2006
Nasty little bug in mod_rewrite

Actually I think the bug is in Apache, but whatever.

Let's say you've got a site like this that uses clean URL's. 

http://someplace.somewhere/something/more/stuff

Now let's say you wanted  to insert a category name in the middle of this URL, and the category name contains a slash. Let's say 'more/less' instead of 'more'. But you can't change the number of slashes, because in the example 'stuff' isn't a category, it's something else.

So what you would normally do in PHP (and many other languages) is urlencode() the name. This gives you a category that looks like 'more%2Fless'. Now you can just urldecode() it and turn it back into a legal category name without messing up the URL.

But the problem is that if you use mod_rewrite to support clean URL's, it currently decodes the URL in the process of doing its work - before you ever see it. So there's no way of knowing if a category has a slash or not. Ditto for hash and several other characters. It turns out the bug is actually in Apache, which is decoding the URL before it hands it off to mod_rewrite, but that part doesn't matter. If you don't use mod_rewrite you won't see the bug. Some of us have to use it though. 

This violates the primary law of encoding information - you must have one and only one decoder for every encoder. Apache/mod_rewrite is decoding something it has no right to touch.   

Fortunately, there's a way out of this mess, but it's very non-standard. You have to further encode the URL so that it can't get automatically decoded by the middle software layers. You could just double encode it which works today, but then if they ever fix the bug, you'll end up with a bad decoding. I'm currently turning %2F into ^2F, since ^ is one of the few characters which isn't normally used in a URL. This then gets encoded by the browser to be %5E2F. It doesn't matter if the %5E part is turned back into a ^, or even decoded more than once (the second and subsequent decodes will essentially be a no-op). All that matters is that the slash remains encoded all the way through this hostile communications channel.

What an ugly hack.

I'd go into apache and fix it, but that won't help me in the short term. I'd have to wait for the patch to get rolled into a future release, and then wait for my service providor to pick up the later release. ...That could be a year or more unless  some urgent security bug pops up. So I guess I'd better get used to living with this hack.  

 

Comments? | More Actions Open/Close menu
Oct 03, 2006
First day of work

Started the new job today. I'm working for a large PC security firm doing web stuff.

That's about all I can tell you right now. 

Comments? | More Actions Open/Close menu
You don't have to wait--you can have it in 5.004_54 or so. :-)
-- Larry Wall in <199710221740.KAA24455@wall.org>