cover photo

Mike Macgirvin

mike@macgirvin.com

fortune_to_html in PHP

Mike Macgirvin
  last edited: Wed, 26 Nov 2008 22:04:48 +1100  from Diary and Other Rantings
One of the problems with using the Unix/Linux fortune (aka fortune-mod) command in web pages is making it readable in HTML. One can provide something that mostly works by substituting any HTML entities (&,<,>, and double quote) and converting linefeed to <br />.

However you're still going to get a lot of fortunes with unprintable characters where the original intent was lost - as many of these used 'backspace hacks' to provide character underlines, accent marks, and on really old fortune databases, using backspace to strike out text and replace it with something more amusing.

Here is a function that should make 99.999% of the fortunes you may encounter that use weird ASCII tricks display in web pages mostly as originally intended.

<?php

function fortune_to_html($s) {

  // First pass - escape all the HTML entities, and while we're at it
  // get rid of any MS-DOS end-of-line characters and expand tabs to
  // 8 non-breaking spaces, and translate linefeeds to <br />.
  // We also get rid of ^G which used to sound the terminal beep or bell
  // on ASCII terminals and were humorous in some fortunes.
  // We could map these to autoplay a short sound file but browser support
  // is still sketchy and then there's the issue of where to locate the
  // URL, and a lot of people find autoplay sounds downright annoying.
  // So for now, just remove them.

  $s = str_replace(
    array("&",
          "<",
          ">",
          '"',
          "\007",
          "\t",
          "\r",
          "\n"),

    array("&",
          "<",
          ">",
          """,
          "",
          "        ",
          "",
          "<br />"),
    $s);

  // Replace pseudo diacritics
  // These were used to produce accented characters. For instance an accented
  // e would have been encoded by '^He - the backspace moving the cursor
  // backward so both the single quote and the e would appear in the same
  // character position. Umlauts were quite clever - they used a double quote
  // as the accent mark over a normal character.

  $s = preg_replace("/'\010([a-zA-Z])/","&\\1acute;",$s);
  $s = preg_replace("/\"\010([a-zA-Z])/","&\\1uml;",$s);
  $s = preg_replace("/\`\010([a-zA-Z])/","&\\1grave;",$s);
  $s = preg_replace("/\^\010([a-zA-Z])/","&\\1circ;",$s);
  $s = preg_replace("/\~\010([a-zA-Z])/","&\\1tilde;",$s);

  // Ignore multiple underlines for the same character. These were
  // most useful when sent to a line printer back in the day as it
  // would type over the same character a number of times making it
  // much darker (e.g. bold). I think there are only one or two
  // instances of this in the current (2008) fortune cookie database.

  $s = preg_replace("/(_\010)+/","_\010",$s);

  // Map the characters which sit underneath a backspace.
  // If you can come up with a regex to do all of the following
  // madness  - be my guest.
  // It's not as simple as you think. We need to take something
  // that has been backspaced over an arbitrary number of times
  // and wrap a forward looking matching number of characters in
  // HTML, whilst deciding if it's intended as an underline or
  // strikeout sequence.

  // Essentially we produce a string of '1' and '0' characters
  // the same length as the source text.
  // Any position which is marked '1' has been backspaced over.

  $cursor = 0;
  $dst = $s;
  $bs_found = false;
  for($x = 0; $x < strlen($s); $x ++) {
    if($s[$x] == "\010" && $cursor) {
      $bs_found = true;
      $cursor --;
      $dst[$cursor] = '1';
      $dst[$x] = '0';
      $continue;
    }
    else {
      if($bs_found) {
        $bs_found = false;
        $cursor = $x;
      }
      $dst[$cursor] = '0';
      $cursor ++;
    }

  }

  $out = '';
  $strike = false;
  $bold = false;

  // Underline sequence, convert to bold to avoid confusion with links.
  // These were generally used for emphasis so it's a reasonable choice.
  // Please note that this logic will fail if there is an underline sequence
  // and also a strikeout sequence in the same fortune.

  if(strstr($s,"_\010")) {
    $len = 0;
    for($x = 0; $x < strlen($s); $x ++) {
      if($dst[$x] == '1') {
        $len ++;
        $bold = true;
      }
      else {
        if($bold) {
          $out .= '<strong>';
          while($s[$x] == "\010")
             $x ++;
          $out .= substr($s,$x,$len);
          $out .= '</strong>';
          $x = $x + $len - 1;
          $len = 0;
          $bold = false;
        }
        else
          $out .= $s[$x];
      }
    }
  }

  // These aren't seen very often these days - simulation of
  // backspace/replace. You could occasionally see the original text
  // on slower terminals before it got replaced. Once modems reached
  // 4800/9600 baud in the late 70's and early 80's the effect was
  // mostly lost - but if you find a really old fortune file you might
  // encounter a few of these.

  else {
    for($x = 0; $x < strlen($s); $x ++) {
      if($dst[$x] == '1') {
        if($strike)
          $out .= $s[$x];
        else
          $out .= '<strike>'.$s[$x];
        $strike = true;
      }
      else {
        if($strike)
          $out .= '</strike>';
        $strike = false;
        $out .= $s[$x];
      }
    }
  }

  // Many of the underline sequences are also wrapped in asterisks,
  // which was yet another way of marking ASCII as 'bold'.
  // So if it's an underline sequence, and there are asterisks
  // on both ends, strip the asterisks as we've already emboldened the text.

  $out = preg_replace('/\*(<strong>[^<]*<\/strong>)\*/',"\\1",$out);

  // Finally, remove the backspace characters which we don't need anymore.

  return str_replace("\010","",$out);
}
NameThingy

Mike Macgirvin
  from Diary and Other Rantings
Since I stopped actively updating this site several months back, it appears the bulk of the incoming traffic has been visiting my various random name generators.

I decided to clean these up a bit and spin them off onto a dedicated site. You can visit it at NameThingy.com. It was quite a fun exercise, as I've managed to reduce the random name generation and all the potential options to a single HTML page that is dynamically refreshed using Ajax. Check it out.
Stoppping XSS forever - and better web authentication

Mike Macgirvin
  last edited: Thu, 21 Aug 2008 11:46:11 +1000  from Diary and Other Rantings
I've been working on all kinds of different ways to completely stop XSS (and potentially the related CSRF) and provide a much better authentication framework for web applications.

The problem:

The HTTP protocol is completely stateless. On the server side each and very page access starts with zero knowledge of who is at the other end of the connection. In order to provide what were once considered 'sessions' in the pre-web computing days, the client is able to store a 'cookie' which is sent from the server, which is sent to every page within that domain. The server can look at this cookie and use it to bind a particular person who has presumably passed authentication so they don't have to re-authenticate.

But cookie storage has some serious flaws. If somebody who isn't the specified logged-in person can read the cookie, they can become that person. IP address checks can help to provide extra verification but in a world containing proxies this information can be spoofed.

Cross Site Scripting is a method whereby a malicious person who is allowed to post HTML on a page can inject javascript code which is then executed on a registered user's session and the cookie is leaked or sent elsewhere - allowing the malicious person to impersonate the registered person.

A possible solution:

I'm still working out the details so please let me know if this is flawed, but I think I've got a way to prevent XSS and still allow registered members to post full HTML, CSS, whatever - including javascript.  It relies on the fact that cookies are stored and used per-domain. Different domains are unable to see cookies from another domain.

We'll also assume SSL connections since anything else can leak everything (cookies, passwords, everything) to a port sniffer.

We'll start with a normal website at https://example.com - which we'll assume is a multi-user website where XSS could be a problem. If somebody on this site can inject javascript onto a page, they can steal the cookies of a logged-in user. There are hundreds of ways to do this that are beyond the scope of this discussion.

But we'll also create another domain - say https://private.example.com - which processes logins and does not serve content. This will have a different cookie than example.com. Perhaps we'll let it serve the website banner image just so that it is accessed on every page of the site. Since there is no active content allowed, it is immune to XSS eploits.

It is allowed to process login requests and send cookies, and one image. That's it.  

What this means from an attacker's viewpoint is that he/she now needs to steal two cookies to impersonate somebody else.  It may be easy to steal the cookie on the main site, but there's no way to get at the cookies for the private.example.com site since it isn't allowed to host active content.

The main site uses out-of-band methods (not involving HTTP) to communicate between the two domains and establish that the session is valid and authenticated. They're both hosted in the same place after all. It can check a file or database to see that the logged in session was authenticated by the other site. Both keys (cookies) have to match or the authentication is denied.

Anybody see a flaw in this? Granted I still haven't thought it through completely and haven't yet tested it, but I don't see any glaring problems on the surface. Some variation of this concept will probably work and both prevent XSS as well as provide a better way of doing web authentication that is much more resistant to intrusion.    

Again assuming https to prevent snooping, the only way I can see to steal both cookies and impersonate a logged-in user is to have access to the target person's desktop and browser.

It also allows a site to completely separate the authentication mechanism from the content server allowing the authentication code to be small, simple, self-contained, and verifiable.
Mike Macgirvin
  
An obvious flaw which quickly became apparent was using an image/entity on the main page to link to the auth server - as the page would then need to be rendered before authentication can succeed. This is backward because you usually want to know the authentication state before you provide content.

So the best way to work this is to use a redirect out front to ensure both domains are accessed before the page is rendered. This in fact matches what many larger sites do for authentication, a separate auth server which passes through to the request server. Using a second session key in another domain to neutralize any effect of stealing the primary session key I believe is relatively rare in practice, although it may be implemented on these larger sites. The basic concept can be applied to small hosted sites very easily without requiring multiple machines and a data cloud architecture. This is what makes it attractive - it can be easily added into any existing hosted community software.  

Also, there are many other reasons why you would want to limit the ability to use javascript on community pages - but these should be to reduce potential annoyance and disruptive behaviour rather than to protect the integrity of your authentication. There are just way too many ways to get javascript into a page to try and protect them all from sessionid theft. But if sessionid theft has no gain, such script restrictions are a matter of choice rather than an absolute neccessity.
Reflection CMS update

Mike Macgirvin
  last edited: Wed, 23 Jul 2008 13:44:25 +1000  from Diary and Other Rantings
At this time, I've managed to pull together a working kernel and prototype of the Reflection CMS. It is not yet ready for public release, but I've been pleased with the progress. Here's a bit of a white paper I've been putting together to explain the rationale and provide a high level overview.

                 Reflection Content Management System

Purpose:

Web content management systems and frameworks that exist today are clunky, overly-complicated, and often insecure. While many of the open source projects are developer friendly and openly encourage derivation, there is often a group that jealously protects the 'core' from feature creep. This makes it difficult to realise many web designs; as it is often the core that is insufficient to the task at hand. Being developer friendly does not mean that an application provides a workable development environment. Add-on modules often cannot be trusted - as they often reflect the work of novice software designers who have had to overcome the limitations of the core product.

In an effort to appeal to the most people, data abstraction is taken to new levels of absurdity and inefficiency. This is not limited to content management systems, as it is a software problem in general.

What I have attempted in taking on this gargantuan task of creating yet another content management system is to solve many of these problems, and to create a system that is extensible and encourages development at all levels - including the so-called core. To that end - most every function can be over-ridden without introducing serious versioning and update issues/incompatibilities. Nothing is sacred.  

The more that I mulled this task, the more it became apparent that what I was looking for in a content management framework is no less than an operating system for web pages. This involves user management, security, and the ability to execute arbitrary 'applications'. It also involves a notion of a file system hierarchy which can be represented entirely by URLs.

Many other content systems abstract data types, and this is a good idea; though it often makes for messy designs. At the heart is a generic nucleus of a content - who owns it, what the permissions are, various timestamps, etc. Data fields that are unique to a particular content item are stored elsewhere and joined on demand.

Implementation of this level of abstraction is a challenging problem. Due to design limitations of most database systems, it involves some tradeoffs - primarily in the ability to perform searches on extended data of multiple extensible data types. For a single type, it can be done with one query. However when multiple data types are involved, a second pass needs to be run to return the extended data for each item. For this reason, it is prudent to store as much 'searchable' information as practical within the nucleus.

There is also general agreement over using themes and templates at the presentation end, so that different renderings are possible without hacking code. Here I'd like to take it one step further and modularise the entire presentation layer. As well as a 'theme', once can choose a particular layout or representation of objects, such as a choice between list view and iconic view, and/or XML feed elements. By making this extensible and arbitrary, entirely new renderings can be accomplished without touching the object code or business logic.

Permissions System

Permissions are the core of any multi-user system. This needs to be well defined, and implemented close to the kernel or core and far away from the presentation layer. In a development environment, the developers should mostly be free of managing permissions. I've implemented a permissions concept similar to Unix/Linux - although modified for better adaptability to web applications. It uses the familiar rwx concept, but I've split the 'x' permission into 'x' and 'u'. 'x' is simply a list permission. 'u' is an ability to use or extend an item. For an article, the 'u' bit allows comment rights. For a vocabulary, it allows the ability to tag something using that vocabulary. I've also introduced higher level permissions. There are six levels:  
  • rwxu admin  
  • rwxu moderators  
  • rwxu owner  
  • rwxu group  
  • rwxu members  
  • rwxu other (aka visitors)
Members is for logged in members. Group is a group association to a unique group identifier, moderators are site moderator accounts. Admin privileges are included in the permissions flags for completeness; though it isn't obvious what value this serves and in most cases these will be masked to prevent locking out the system admin from managing the system.

The Directory Object

The directory or folder object is the primary means of implementing complex data structures and representations. It is an object like any other object on the system, but when navigated to, presents a listing of those items which are attached to it as siblings. It implements a general purpose search and list/enumerate operation. It also contains a path/filename to distinguish it in the URL hierarchy and provide file system semantics to database objects. However, the important items that it contains are a umask (permissions mask) which is applied to any child items, and it can also be configured only to hold items of certain types. This is what distinguishes a photo album from a weblog or forum list. One holds photos and the others hold articles. By allowing a directory to hold any type of content, it can be made to resemble a traditional filesystem; and indeed a multi-user website can be implemented which provides member sub-sites that they manage completely.  

The directory also has complete control over the presentation layer, via themes, renderings, and menu selection. This implies that directory is not simply a 'list', but the complete embodiment of the controls, settings, and the look of that list. These can be inherited and passed on to sub-directories. A limitless range of site policy and structure can be implemented by controlling the settings of the appropriate directory entries.

Applications

Applications or executable code lives outside the virtual directory tree. In order to address the need for an extensible application space and recognising the confines of URL management, applications are denoted by the first URL path parameter. For instance http://example.com/edit invokes the object edit/post application. Additional URL path components are passed to the application as arguments an a manner similar to Unix/Linux 'argv/argc' mechanisms. Application URLs take precedence over path URLs, such that creating a directory or document called 'edit' at the root level will be unavailable at that URL if the 'edit' application exists. An external path alias mechanism exists to redirect to another URL in the case of conflict with the application space.

An application framework exists that supplies plugin methods for handling initialisation, form posts, main page content, and menu callbacks. Arguments are parsed and passed in as argv/argc elements, although meta-arguments dealing with pagination (such as 'page=4') are dealt with by the kernel or core to minimise extra argument parsing at the application level. To provide pagination, an application only needs to obtain a count the total number of items and invoke a 'paginate' function.

Licensing

Reflection will be available under the generic Berkeley license. Free for all uses but with no implied warranty.

Platform

Recent/modern flavours of LAMP. Apache/mod_rewrite is required. PHP5.2+ is required for timezone support. Language: English.
The Reflection CMS Project

Mike Macgirvin
  from Diary and Other Rantings
Just wanted to update y'all on current happenings since I terminated my daily rants a while back...

I've been working under the covers on a new web project; which takes all that I've learned building this here website and social spaces in general and pushes it into a new realm.

The thing about CMS software is that they all suck. Some suck worse than others, but they're all really, really bad. Most of them try to be all things to all people - and as a consequence fail miserably at being anything to anybody. I guess I've been guilty of that myself.

I'll be putting up a serious contender over the next several months to show that the situation doesn't need to be so abysmally abysmal. Oh yeah, and it will be open source, extensible, yada, yada. While basically working securely and outperforming any of the competition - without resorting to caching to make up for the sucky performance; like everybody else does.

In order to accomplish this, I'm not even going to try to create something that is all things to all people. Apache2.x+, php5.x+, mysql5.x+ and English only. I've re-written my existing website engine to be leaner and meaner and am currently adding some core functionality back in, whilst tossing 90% of the code that nobody (but me) ever used.

I've boosted performance by a factor of 4 at least, and will be reducing the number of database queries per page to under 10 on average (from a current average of 20-35); still way under the market leaders which hammer the database several hundred times for each and every page - and hit the file system an equal number of times. That's piss poor engineering and an embarrassment to any serious software developer.  

Security on each object has been radically simplified - however is extremely robust and verifiable.

Stay tuned...
Q=`(parlez vous|hablas|sprechen ze) $geek?`

Mike Macgirvin
  from Diary and Other Rantings
It just occurred to me that in the last 4-5 days I've written 'code' in Visual Basic, SmallTalk, lisp, C, bash, awk, PHP, perl, and python. Thousands of lines of code total. And there are probably a few dialects I forgot here. Not to mention 30-40 different flavours of config files, sed, Oracle-SQL, mySQL, and LDAP and some other stuff that don't quite qualify as 'code' but still involve intimately knowing a strange computer dialect. Oh yeah, HTML and JavaScript (of course).
Installing Oracle (oci8) into pre-built Debian php5

Mike Macgirvin
  last edited: Wed, 21 May 2008 11:02:14 +1000  from Diary and Other Rantings
Some notes to save somebody some grief:

Installing the Oracle libraries and access module into an existing PHP5 installation on Debian...

First grab the Linux instantclient from oracle.com - you'll also need the client SDK kit. Here I'm using instantclient 11.1

create a directory for these such as /home/oracle and unpack both of them to that directory.

Go into the oracle directory (and into the instantclient_11_1 directory) and create a symlink:

$ ln -s libclntsh.so.11.1 libclntsh.so

Grab the oci8 PECL package and unpack it somewhere (~/oci).

Make sure you have the following packages (in addition to php5, php5-cli, apache2, etc).

  php5-dev

  libaio1

  php-pear
  

Go to the oci8 directory (~/oci/).

You need to run 'pecl build' once to create the configure script.

$pecl build

But the problem is that pecl build will claim the files are installed and they are not. I wasted half a day on this one. Now go into the oci8-1.3.0 directory and build again by hand:

$ cd oci8-1.3.0

$ ./configure --with-oci8=instantclient,/home/oracle/instantclient_11_1

$ make

Fix any errors/warnings before continuing

Don't make install, which won't work.

$ cp ./modules/oci8.so /usr/lib/php5/20060613+lfs

Replace 20060613+lfs with whatever module directory has been setup for you in /usr/lib/php5

Create /etc/php5/conf.d/oci8.ini:

----

extension=oci8.so

----

Now run the php cmdline in verbose mode (php -v) and see if everything loaded. Fix it if it didn't.

You may need some env variables setup in your /etc/init.d/apache2 file to make everything work and actually execute queries, but a phpinfo() at this point should show your oci8 extension. See the php.net Oracle pages if you need help with the env variables.

Restart the web server

$ /etc/init.d/apache2 restart
Gail
 
Many thanks! You helped me install oci8 in a few minutes... :-)

Greetings from Tirana, Albania.
SPHINX
 
Thanks man. It still took a while (had some include issues), but it would have been a marathon without this post. Lifesaver.

SPHINX www.battleforces.com
Sun acquires MySQL

Mike Macgirvin
  last edited: Fri, 18 Jan 2008 09:30:46 +1100  from Diary and Other Rantings
Unless you've been watching closely, this announcement was easy to miss. Sun Microsystems is acquiring MySQL. This has ramifications both good and bad.

This will likely affect a huge number of people who are currently using open source web applications; a majority of which are being stored on MySQL databases. Their future viability is now questionable. It all depends on the license and revenue models Sun chooses to adopt.

I would also try to steer clear of the pending 6.0 release as it is likely to involve significant re-structuring of the code to suit Sun's business requirements. It may be a year or three before it stabilises again. Sun is legendary for introducing layers of bureaucracy into development projects.  

While Sun may make public announcements of their intent to continue to provide the product for free [and it should be noted that there was no such announcement in the press release], it is difficult to imagine the corporate bean counters not making a recommendation to derive as much revenue stream as possible from the acquisition.

You can read the announcement here.

Also of potential interest is this (dated) history of MySQL
xml playground revisited

Mike Macgirvin
  last edited: Thu, 10 Jan 2008 16:01:49 +1100  from Diary and Other Rantings
Looks like I got sidetracked from my original mission to use this website as an xml playground to explore and develop new communications technologies, and instead wrote a social portal that hardly anybody cares about. That was a few years ago now. Well, I haven't given up. It just took a while to reach the state where I can get beyond the user-interface plumbing and get back to the machine interfaces which is where the fun is.  

Feeds have improved a lot. I'm using Atom paging now. Still holding off on atom-thread for comments since I can do it so much easier embedding into the articles - though I note that the latest Firefox parses atom-thread just fine. No use forcing it on the public until a few more feedreaders have jumped on board. The code has been working for a year or two, but I'm just waiting for the rest of the world to catch up before I turn it back on.

I've been playing with a weblog export tool that's basically an Atom feed, but replaces images and attachments with inline data: URL's. Have had a few glitches - including a PHP bug in the regular expression library that I need to report. But this in theory can let you take an entire weblog and move it elsewhere as one gigantic XML file. Everything. Images, attachments, comments, categories, the whole nine yards.

I've also got Atom Publishing Protocol support in a very primitive state (but not yet ready for prime time). This is a big effort and I don't expect to be finished for a few months. I've got a suitable framework, but this site works a bit differently than the model used by the atom publishing spec. It will take a while to resolve all the differences so it plays nicely. This would for instance allow you to import your entire weblog from elsewhere in the world - especially one that used data: URLs to bring in images and attachments. Otherwise if I use the default model, I've got to package everything into a workspace for export, and this takes more than one file to represent all the structures completely. But that's the big picture - on a smaller scale, you should soon be able to publish weblog posts from your cell phone, or sync new articles with another weblog you may have. I 'm also not bothering with the xml-rpc remote mechanisms for publishing. They're primitive now, the api's too fragmented, and pretty much dead.

Oh yeah, and we've got trackbacks now - for any weblog that allows non-member comments. This is a flavor of xml-rpc. It isn't a big deal, but a few folks have requested it. You can find the trackback URL in the 'more actions' menu of articles - that is for any member weblogs that allow them. Mine does.  

Oh, and photo albums can now be exported as zip files. That has nothing to do with XML...
John
 
Cool. Don't data: URLs involve a 33% expansion in size and come with some potential size limits (in libraries if nothing else)?

A general format for Atom that allowed cross-references in URLs (like cid:) would be useful both for images and other attachments and for related feeds like comments, trackback snippets, etc.
Mike Macgirvin
  
Yeah, there are issues. I'm just trying to figure out how to get there from here. Right now data: URLs are the only way I can come up with to encapsulate everything. If a few people adopted it, it might be viable. At least everything to export an entire blog in a single file would be standards compliant. I'd be glad to see something better...

Hey congrats - I hear you're at Google now....
OS madness, chapter #7936

Mike Macgirvin
  last edited: Thu, 08 Nov 2007 14:54:42 +1100  from Diary and Other Rantings
Still struggling with device drivers on Windows Vista. The sound card drivers have an update, but I'm skeptical. Several folks reported BSOD when they installed it.

And I've lost any good feelings I had for Debian. Recently I moved my old RedHat installation to a newer PC - one that was only 8 years old rather than 10. All went extremely smooth. On bootup, it found the new motherboard, new network card, mouse, monitor, etc. - and configured all of them.  Everything worked fine.

Then I upgraded to Debian. The RedHat was a couple of years old, and I didn't want to mess with building PHP, MySQL, and Apache upgrades as well. Just boot up a newer Linux. Debian is currently one of the more popular Linux flavors - and I especially like the APT package management utility. Need PHP? 'apt-get install php'. You don't need to build and configure it and mess with library dependencies. These are all taken care of. If it needs new libraries, these are installed as well as any libraries that they depend on.  

Anyway, now (a couple of weeks later) I put in another newer PC - this time only 4 years old. I was expecting everything to go smoothly like it did last time. But it didn't. Debian doesn't have very good hardware (re-)detection, and they also don't load any other drivers than what is absolutely necessary. So I'm faced with an incomplete operating system that doesn't recognize the monitor or ethernet card. And I can't load in the modules for these devices over the net, because it doesn't recognize the network card. It's a Catch-22.

The only solution now is a re-install. Spend a few weeks getting everything configured and then start over. Right. I've been here before. Way too often...

But if you're one of those folks considering moving away from RedHat/Fedora, beware. It's nice to be able to plop your disks into another box if the one you've got goes bad - and keep running. Debian won't do this.
Daylight Savings

Mike Macgirvin
  last edited: Sat, 27 Oct 2007 14:20:00 +1000  from Diary and Other Rantings
Tomorrow is Daylight Savings. Remember 'Spring forward, fall back'? That's right - tomorrow we move it forward, no matter how odd that may seem. It's October, but it's spring.

At least the Australian government hasn't been mucking with and tweaking DST as it did before the 2000 Olympics. The software engineers need time to code in the changes - I think that a lot of the world now has Sydney time right.

Well that would be anybody using the Olsen timezone databases. I know personally about thirty web services which just give you a choice of 'GMT+10' - and these are all going to be wrong tomorrow. On the bright side, I really don't care if they get it wrong. I'm not using any of them for anything globally time sensitive. It always makes my head hurt trying to figure out how many hours I'm going to be away from GMT with all the conversions and tweaks in effect. I suppose it'll probably be GMT+11. One hour forward. But wait, we'll then be one hour closer to Greenwhich, England as the earth spins. Not further from it. So maybe it's GMT+9. Silicon Valley will be... Uh, I give up. It's in negative GMT and the time is going back. So is it forward or backward? I'll have to figure it out on paper to work out the difference between LA (where this server is) and Sydney (close enough to where I am).  

But this will also give a good test of my own daylight savings and timezone functions (which use Olsen tables). The U.S. is going one hour back and we're going one hour forward. I might be poring over the code tomorrow if something gets askew.
Mike Macgirvin
  
In fact the time changed here - I was just a bit premature on when it changes there. They used to try and change the whole world on the same day, but you're right. Last year's energy act messed up that part of it.

No worries. Everything seems to be working. It just means I'll have to go through all of this again when you folks change over. I won't bother calculating the delta right now, since it's in a temporary state. It's nice to know the delta before I make a phone call overseas. Nothing worse than 'Hello? Who is this? It's 3 in the morning!'
peonyden
 
Hi Mike Thanks for posting about Daylight Saving Changes. We have now changed, but on the Sunday morning, so you might have been early to all events on Saturday, if you changed on 27 October. But that's better than being completely out. It is one thing to be unaware. You end up 1 hour late on the first day of the change. I went the wrong way, once, (the "autumn" change) and put my clocks forward, at the end of the season. I was two hours early for everything, and thought I had missed a series of important events. but they had not yet happened. I was in a panic - unnecessarily. The little adage "Spring Forward, Fall Back" which you quoted is now permanently engraved in my brain as a result of the scare I had on that occasion. Denis
peonyden
 
Mike

Now that I am properly awake, I see that your comment: "that the time had changed here" was written on Sunday morning - sorry if I implied you were a day ahead.

That's the other problem with Daylight Saving changeover. Body awake, but brain not yet awake.

Denis
Multiviews is good, unless it's bad

Mike Macgirvin
  last edited: Tue, 11 Sep 2007 09:01:51 +1000  from Diary and Other Rantings
This had me pulling my hair out yesterday, so I thought I'd share the experience with enough key terms that the next person pulling their hair out will find it.

I was installing my CMS software on a work machine. I'll likely be doing additional development on it, and the university is the best place to do this. But that's neither here nor there. My software is designed around 'clean URLs'; which means what you see in the URL bar isn't (usually) littered with code and operating system artifacts. So for instance to post to my weblog, I go to the URL /post/weblog, not something like post.php?op=weblog.

To accomplish this, I use an Apache webserver module called 'mod_rewrite', which takes care of the nitty details of this process. Mod_rewrite is not without its faults, but that's the subject of another article. It does the job. The biggest thing it does is let you leave out the '.php', except I'm letting it do a whole lot more.

Anyway I'll cut to the chase. My software was horribly broken after installing it yesterday. It took hours to figure out why. Something else was trying to provide clean URLs and strip '.php' from places where I actually needed to have it in order for things to work. Well that's not technically correct either. It was actually executing PHP files by URL without the extension. Except that these were 'include files'. They weren't meant to be executed directly. They were meant to be included in something else, and the something else was managed by mod_rewrite. The something else was never getting called.  

This was inconceivable. Nothing in any of the system release notes said anything about some magic new clean URL ability. This was on Debian (edge). Apache, PHP, MySQL. I tried all the sites. I googled for everything I could think of. Clean URL debian. Clean url apache. '.php not needed'. mod_rewrite. strip file extension. ForceType. (ForceType also lets you execute files without providing a filename extension). I scoured the last several months of Apache release notes, to no avail.

Finally after several hours I happened upon a little gem of a snippet on an obscure website. 'Turn on Multiviews instead of ForceType'. Debian has multiviews turned on by default, but this is the first I'd heard of it. I had assumed (never do this of course) that it was yet another fancy mod_dir option or something I didn't care about.

No. Multiviews is a slick trick for Apache that takes any pathname, and if it thinks it can find a page to return, it returns it. It uses the basename of the file in the URL and if there's no file, it looks for the filename with an extension. Any extension. Then it sends the file back. So it gives you a clean URL. You type in 'index' and it will send back 'index.htm' or 'index.html' or 'index.php' or 'index.pl' or 'index.shtml'. You get the idea. You can test this on any site that has multiviews turned on by asking for 'index' and see if you actually get a page. Normally you wouldn't.  

If the URL is 'post' and there's a file in the directory called 'post.php' it will send that file back even if you don't want it to. So I'll let you research it further if multiviews is what you want. It's actually pretty cool. In my case I had to disable it.

Options -Multiviews in the .htaccess did the trick and made everything work again.
peonyden
 
Hi Mike Glad you got all that off your chest. That's what blogs are for. Better than kicking the dog, anyway. Someone out there working with Multiviews and Apache (whatever they are) will appreciate your explanation, and advice. I have no idea what the problem you encountered is all about, but I feel your pain, anyway. Cheers Denis
Another new chunk of code

Mike Macgirvin
  from Diary and Other Rantings
I've been thinking about how to do this for ages. Finally did it. You'll find some new stuff in the menus called 'Sharing'. No, it isn't about that half of a ham sandwich you're having for lunch.

Over the past few months you've gained the ability to subscribe to various information sources and recently to configure what features you desire from the website. I mentioned this ability months ago in my grand vision - kinda' like myspace on acid. Well I finally finished the thing which takes that vision and turns it into reality. When you create these private views of the website, it's like having your own private Content Management System (CMS). In fact that's the way I like to describe it to people. There are lots of folks peddling multi-user community CMS's. Yada, yada. But nobody that I'm aware of is selling or even working on a 'personal CMS as part of a multi-user community'.

So what's this 'Sharing' all about? It's simple. You spend the effort to configure your personal content system whatever way you want. Then you can turn around and let other people use it.  

See with MySpace, you get a page that's all yours to mess with. Some of the other community sites give you a few more things. But this let's you build your own complete community site from within a broader community platform. You can make this look like an auto racing site and when people visit your shared page, and then any other page on the site, that's what they'll see. You're in total control until they go elsewhere or turn it off. You control the menus, you control the skin, and you control all of the information sources for your space. They can be your info sources, they can be my info sources, or pull in whatever feeds you want from anywhere. Want your audience to have mail? Chat? And discuss the Confederate War? Two minutes. Photo dating website? Two minutes.
parental controls

Mike Macgirvin
  from Diary and Other Rantings
The latest feature drop is a parental controls infrastructure. You can find out more on (towards the bottom of) the help page. One of the key components is a tool to allow you to turn any of the site features on or off - either for you or for your children. This combined with subscriptions lets you fine tune exactly what you wish to see when you visit. I've been using this immediately to provide different feature sets to the music portal vs. the community portal vs. software engineering portals.

As always, this is just a hint at the types of things you can do. Any member for instance can turn off chat, the photo ratings, guitar chord charts, or anything else they don't care about.

I think I'm finally running out of website features to implement...
Nasty little bug in mod_rewrite

Mike Macgirvin
  last edited: Sun, 08 Oct 2006 06:03:47 +1000  from Diary and Other Rantings
Actually I think the bug is in Apache, but whatever.

Let's say you've got a site like this that uses clean URL's.

http://someplace.somewhere/something/more/stuff

Now let's say you wanted  to insert a category name in the middle of this URL, and the category name contains a slash. Let's say 'more/less' instead of 'more'. But you can't change the number of slashes, because in the example 'stuff' isn't a category, it's something else.

So what you would normally do in PHP (and many other languages) is urlencode() the name. This gives you a category that looks like 'more%2Fless'. Now you can just urldecode() it and turn it back into a legal category name without messing up the URL.

But the problem is that if you use mod_rewrite to support clean URL's, it currently decodes the URL in the process of doing its work - before you ever see it. So there's no way of knowing if a category has a slash or not. Ditto for hash and several other characters. It turns out the bug is actually in Apache, which is decoding the URL before it hands it off to mod_rewrite, but that part doesn't matter. If you don't use mod_rewrite you won't see the bug. Some of us have to use it though.

This violates the primary law of encoding information - you must have one and only one decoder for every encoder. Apache/mod_rewrite is decoding something it has no right to touch.  

Fortunately, there's a way out of this mess, but it's very non-standard. You have to further encode the URL so that it can't get automatically decoded by the middle software layers. You could just double encode it which works today, but then if they ever fix the bug, you'll end up with a bad decoding. I'm currently turning %2F into ^2F, since ^ is one of the few characters which isn't normally used in a URL. This then gets encoded by the browser to be %5E2F. It doesn't matter if the %5E part is turned back into a ^, or even decoded more than once (the second and subsequent decodes will essentially be a no-op). All that matters is that the slash remains encoded all the way through this hostile communications channel.

What an ugly hack.

I'd go into apache and fix it, but that won't help me in the short term. I'd have to wait for the patch to get rolled into a future release, and then wait for my service providor to pick up the later release. ...That could be a year or more unless  some urgent security bug pops up. So I guess I'd better get used to living with this hack.
XSS

Mike Macgirvin
  last edited: Sun, 25 Jun 2006 04:32:42 +1000  from Diary and Other Rantings
I can write most any kind of software and usually do it pretty well. But there are times when it's better to let somebody else do the dirty work. In this case it's so-called Cross Site Scripting or XSS. For a community site such as this it's a nightmare - but one which refuses to go away.

In simple terms, it's Javascript injection. If you can get code onto a page, somebody will execute it by visiting that page, and one can exploit the fact that somebody is running their code. These exploits can range from minor infractions to serious felonies, and you can stick the code most anywhere that you can type something and have it show on a web page.

I had several regex's setup to stop XSS and still allow HTML authoring, but it turns out that the browsers have too many holes to plug with a few regex's. The XSS hack which took down myspace.com was instigated by putting javascript code into a stylesheet and breaking up the word jav a sc r ipt. Internet Explorer gladly packed it back together and ran the code. IE will also execute the same code written in hexadecimal. You can't keep writing regex's to stop all this stuff. Regex's aren't the correct tool for the job (they are part of the solution, but not the total solution). At some point it requires an HTML parser to take the 'bad' HTML one character at a time, and rebuild it into good HTML.

There are four possible solutions:
  • Ignore the problem and hope it goes away. It won't.
  • Do away with HTML authoring completely and either force everybody to learn another tag system or just force everybody to use plain text.
  • Write an HTML language parser to rebuild the code based on every historical variation of HTML which might be encountered.  
  • Let somebody else write this parser.
I hate writing parsers, and in this case the task is to write a parser which duplicates the code flow of the most horrifically buggy web browsers.  

So I went with number 4...

PS> I found one developer website which seriously recommends using 'strip_tags' in PHP to make your site safe from XSS attacks. It won't, because strip_tags doesn't recurse. One can embed tags within tags and blow right through it. They should be shot. If you'd like to have a look at the number of ways that hackers can blow through your security, visit http://ha.ckers.org/xss.html
Fortune in mySQL

Mike Macgirvin
  last edited: Thu, 27 Apr 2006 08:59:21 +1000  from Diary and Other Rantings
So would anybody be interested in obtaining a mySQL import file containing the complete contents of fortune-mod? If you have need of such a thing, you'll know what to do with it. I whipped this up after finding that the latest fortune-mod package now takes about 7 seconds on a fast machine to produce a cookie. This is totally unacceptable. The reason for this is that some recently added command line switches let you specify the probability that any particular category will be chosen. In order to calculate this number, the program by default loads in all the fortunes from all the category files so it can count them all.

This collection belongs in a database. Randomly selecting one element from within a collection of elements and then from within a collection of flat files (after counting each entry and then assigning probability weighting to each file) is ludicrous.  

So I looked all over the web for a mySQL table containing the fortune database but couldn't find a single one. Weird. In my travels I found a tool to add fortunes to mySQL but it was pretty badly written. So I started over and did all the dirty work for you. Use these tables as you wish.  

First file is fortune-mod.1.99 in a table named 'fortune' with two elements, 'id' and 'fortune'. 'id' auto-increments.

Second file is the 'offensive' tree in a table named 'fortune2' which is otherwise identical to the first. You can keep them separate, merge them, whatever.  

You can pull out a fortune cookie with the following SQL - just point $table at either fortune or fortune2.  

SELECT `fortune` FROM `$table`ORDER BY RAND() LIMIT 1

To display on the web, you'll need to escape the HTML entities and translate linefeeds (or it will look like mush). I use something like this... where $cookie is the raw fortune from the SQL query:
str_replace(array("&","",'"',"\n"), array("&","<",">",""","<br />"), $cookie);
The third file (fortune-mod.sql.gz) is the complete fortune-mod.1.99 in a single table. This table contains a couple of extra fields. The 'category' field contains the original fortune-mod category name, such as 'quotations' or 'zippy'. There is also an 'offensive' field which is an integer 0 for normal fortunes, 1 for offensive fortunes. Making use of these extra fields is left as an exercise for the reader. One more thing... the third file also has been stripped of all the fortunes containing backspace characters. These might be OK for a text terminal, but they look pretty ugly on the web.

Enjoy.