ModRewrite Woes (Solved!)

Problems with ModRewrite, relative URLs, base paths, things executing without extensions being specified, and using MultiViews — read on.

While working on a project, I stumbled into some of the weirdest Apache2 mod_rewrite problems that I’d ever seen.

The goal was to make a URL like http//www.nowhere.com/item/1234 turn into http://www.nowhere.com/item.php?id=1234. Trivial, and I’ve done it all the time.

RewriteEngine on
RewriteRule ^item/(.+)$ item.php?id=$1 [L]

This time it wasn’t working the way I expected. When I used the human-readable version, my page got delivered by I had no images, no css, no javascript. Yet, if I used the computer-friendly long form with parameters, it worked just fine.

A little examination with Safari’s activity window showed me that in the initial case the browsers were looking at all relative URLs as if they were prefixed with /item/. This make sense, because the URL redirect knows how to play rewrite games with the rules to get to my page, but the relative links on those pages, to css, graphics, and js, had no clue this was a fake base url.

Many thanks to richardk who pointed out multiple solutions back in 2005.

  • Don’t use /, and there isn’t a problem.
  • Use absolute paths, though you have to edit all the links on your page; if using PHP, consider a variable for the base path.
  • Use a RewriteRule to hack off the offensive directory that doesn’t exist.
  • Or, use the <BASE …> tag.

Well, that rendered the page prettier, but I realized my argument wasn’t being passed in. Yet, the re-write rule was correct.

So I tried http//www.nowhere.com/item, which should not have matched and should not have brought up a page. Yet it did.

A little experimentation showed that any page that had a known extension was getting delivered.

What this meant was that the moment the browser saw /item it found the item.php page and delivered it without ever going through Apache’s rewrite module, and hence no parameters.

Luckily, I’ve encountered this symptom before in a different context. The offender: MultiViews. This is the bugger that deals with multiple language support; you know, where you have a zillion internationalized instances based on filename extensions….

Turning that off instantly solved the problem of delivering a file without an extension:
# Options Indexes FollowSymLinks MultiViews
Options Indexes FollowSymLinks

That also meant that the mod-rewrite rules worked. And that meant the parameters were passed correctly. And that meant I was was happy, because the code was working.

Firefox Slow Page Load – Solved

Firefox 3 slow? 20 second page load times? Figured out why. And how to fix it.

A co-worker showed me an interesting problem with Firefox today. He loaded a page from our application (running on localhost) and the page content loaded instantly, but the page load itself didn’t end until a time out 20 seconds later. Literally.

Everything we saw a measured from the browser or from the sending application showed that the content was sent in milliseconds, and the page load was just sitting there doing nothing. We were even using the latest Firefox beta.

Other browsers had no such problem.

Turns out, we figured out what was going on using the Tamper Data add-on.

Turns out there was a Connection: keep-alive in the header. When we changed it from keep-alive to close, the browser behaved as expected. That is, it loaded the page instantly.

A little web investigation showed that when you use the keep-alive attribute, you must also use Content-Length: header, which the sending application wasn’t doing.

A quick application tweak to send the content length, and everything ran super spiffy.

Now, if you don’t have access to the application that’s sending you web pages, you can twiddle with the about:config and change the network.http.keep-alive setting to false.

Using In A JavaScript Literal

Today I got bit by a very interesting bug involving the tag. If you’re writing code that generates code, you want to know about this.

I’m currently working on an application that takes content from various web resources, munges the content, stores it in a database, and on demand generates interactive web pages, which includes the ability to annotate content in a web editor. Things were humming along great for weeks until we got a stream of data which made the browser burp with a JavaScript syntax error.

Problem was, when I examined the automatically generated JavaScript, it looked perfectly good to my eyes.

So, I reduced the problem down to a very trivial case.

What would you suppose the following code block does in a browser?

<HTML>
<BODY>
  start
  <SCRIPT>
    alert( "</SCRIPT>" );
  </SCRIPT>
  finish
</BODY>
</HTML>

Try it and see.

To my eyes, this should produce an alert box with the simple text </SCRIPT> inside it. Nothing special.

However, in all browsers (IE 7, Firefox, Opera, and Safari) on all platforms (XP/Vista/OS X) it didn’t. The close tag inside the quoted literal terminated the scripting block, printing the closing punctuation.

Change </SCRIPT> to just <SCRIPT>, and you get the alert box as expected.

So, I did more reading and more testing. I looked at the hex dump of the file to see if perhaps there was something strange going on. Nope, plain ASCII.

I looked at the JavaScript documentation online, and the other thing they suggest escaping are the single and double quotes, as well as the backslash which does the escaping. (Note we’re using forward slashes, which require no escapes in a JavaScript string.)

I even got the 5th Edition of JavaScript: The Definitive Guide from O’Reilly, and on page 27, which lists the comprehensive escape sequences, there is nothing magical about the forward slash, nor this magic string.

In fact, if you start playing with other strings, you get these results:
  <SCRIPT> …works
  <A/B> …works
  </STRONG> …works
  <\/SCRIPT> …displays </SCRIPT>, and while I suppose you can escape a forward slash, there should be no need to. Ever. See prior example.
  </SCRIPT> …breaks
  </SCRIPTX> …works (note the extra character, an X)

With JavaScript, what’s in quotes is supposed to be flat, literal, uninterpreted, meaningless test.

It was after this I turned to ask for help from several security and web experts.

Security Concerns


Why security experts?

The primary concern is obviously cross site scripting. We’re taking untrusted sites and displaying portions of the data stream. Should an attacker be able to insert </SCRIPT> into the stream, a few comment characters, and shortly reopen a new <SCRIPT> block, he’d be able to mess with cookies, twiddle the DOM, dink with AJAX, and do things that compromise the trust of the server.

The Explanation


The explanation came from Phil Wherry.

As he puts it, the <SCRIPT> tag is content-agnostic. Which means the HTML Parser doesn’t know we’re in the middle of a JavaScript string.

What the HTML parser saw was this:

<HTML>
<BODY>
  start
  <SCRIPT>alert( "</SCRIPT>
  " );
  </SCRIPT>
  finish
</BODY>
</HTML>

And there you have it, not only is the syntax error obvious now, but the HTML is malformed.

The processing of JavaScript doesn’t happen until after the browser has understood which parts are JavaScript. Until it sees that close </SCRIPT> tag, it doesn’t care what’s inside – quoted or not.

Turns out, we all have seen this problem in traditional programming languages before. Ever run across hard-to-read code where the indentation conveys a block that doesn’t logically exist? Same thing. In this case instead of curly braces or begin/end pairs, it was the start and end tags of the JavaScript.

Upstream Processing


Remember, this wasn’t hand-rolled JavaScript. It was produced by an upstream piece of code that generated the actual JavaScript block, which is much more complex than the example shown.

It is getting an untrusted string. Which, to shove inside of a JavaScript string not only has to be sanitized, but also escaped in such a way that the HTML parser cannot accidentally treat the string’s contents as a legal (or illegal!) tag.

To do this we need to build a helper function to scrub data that will directly be emitted as a raw JavaScript string.


  1. Escape all backslashes, replacing \ with \\, since backslash is the JavaScript escape character. This has to be done first as not to escape other escapes we’re about to add.
  2. Escape all quotes, replacing ' with \', and " with \" — this stops the string from getting terminated.
  3. Escape all angle brackets, replacing < with \<, and > with \> — this stops the tags from getting recognized.

private String safeJavaScriptStringLiteral(String str) {

  str = str.replace(“\\”,”\\\\”); // escape single backslashes
  str = str.replace(“'”,”\\'”); // escape single quotes
  str = str.replace(“\””,”\\\””); // escape double quotes
  str = str.replace(“<“,”\\<“); // escape open angle bracket
  str = str.replace(“>”,”\\>”); // escape close angle bracket
  return str;
}

At this point we should have generated a JavaScript string which never has anything that looks like a tag in it, but is perfectly safe to an XML parser. All that’s needed next is to emit the JavaScript surrounded by a <![CDATA[]]> block, so the HTML parser doesn’t get confused over embedded angle brackets.

From a security perspective, I think this also goes to show that lone JavaScript fragment validation isn’t enough; one has to take it in the full context of the containing HTML parser. Pragmatically speaking, the JavaScript alone was valid, but once inside HTML, became problematic.

An Advanced Crash Course in AJAX

So you know a little bit of JavaScript and you’re aware in general what AJAX is, but now you actually have to do it and things aren’t quite as smooth or as easy as you thought. Here’s a quick guide through some trouble spots.

Libraries, like Prototype and jQuery, abstract all this away…
Let’s say that you’ve got a basic understanding of JavaScript, you roughly know what AJAX is, and you can twiddle the DOM, but now it’s time for the rubber to meet the road and you want to get up to speed, know about the quirks, and learn hidden tidbits that come from head bludgeoning against the wall experience.

This guide is a quick romp through AJAX, stopping at all the little pieces that you might not know about.

Making a Request / Response


It turns out that basically every browser on the planet does XMLHttpRequest the same way, with the exception of the evil Internet Explorer, which uses ActiveXObjects distributed with the operating system, and even then it does so inconsistently. In theory, the new IE7 is supposed to conform to the “right” way, but given there are still so many 5.0, 5.0, and 6.0 IE browsers out there, this has made a rats nest out of what should have been simple code to start with. Here’s the fundamental code that returns a browser-neutral object:

function createRequest() {
  var request = null;
  try {
    request = new XMLHttpRequest(); // Everyone but IE
  }
  catch (trymicrosoft) {
    try {
      request = new ActiveXObject("Msxml2.XMLHTTP");
    }
    catch (othermicrosoft) {
      try {
        request = new ActiveXObject("Microsoft.XMLHTTP");
      }
      catch (failed) {
        request = null;  // Always check for NULL!
      }
    }
  }

  if ( request == null ) // Might as well check here
    alert("Error creating request object!");
  } else return request;
}

An aside: It turns out that Internet Explorer 5.x on the Mac (an old, broken, and discontinued product from Microsoft) doesn’t work — AJAX can’t be done, as there is no ActiveX control, and they don’t do it the “standard” way with the browser.

To use such a function in your pages, you’d do something like this:

function getSomething() {
  var request = createRequest();
  var url = "http://www.yourhost.com/serverside";
  request.open("GET", url, true );  // true = asynchronous
  request.onredystatechange = callback;
  request.send (null);  // or whatever data
}

Note: if you use POST, then you also have to set the request header, .setRequestHeader(), usually this will be “Content-Type” with a value of “application/x-www-form-urlencoded” when just sending form data. Otherwise the server has no idea what is being sent in the POST.

The callback function, which can be different for each request, needs to check a ready state and a status — as the call back gets called four times during the process:

function callback(request) {
  if ( request.readyState == 4 ) { // 4 = response downloaded
    if ( request.status == 200 ||  // 200 = success
         request.status == 304 ) { // 304 = not modified
      // Do something
      var response = request.responseText;
    }
  }
}

Note: readyState is a read-only property – you can’t set it.

Another Note: you can use responseXML instead of responseText for XML! You manipulate it just like the DOM, making use of getElementsByTagName(). This requires a Content-Type of “text/xml” from the server to work.

Libraries, like Prototype and jQuery, abstract all this away so that you simply provide the URL, GET/POST type, Asyc/Sync type, and a call back — with special forms to do common tasks, like filling in a DIV with pulled content with one call.

Note: if you find yourself calling getElementById(), you need to look at the $() function of these libraries. And, yes, a dollar sign is a legal identifier character, making a lone dollar sign a valid variable or function name.

Keep in mind, if you start adding other third-party Prototype libraries, they may fail if you use the jQuery enhancements. Fret not, because you can have your cake and eat it to. jQuery lets you unhook itself from the standard AJAX shortcut conventions.

It’s tempting to skim through this, assuming that “you get it” — but the devil is in the details.

Here’s where things get extra tricky.


Browsers Cache Dynamic Responses
Internet Explorer and Opera actually cache the response to a given URL request. That means if you do a GET, the first one will work, but subsequent ones will not. The browser will go “oh, I remember sending this before, here’s the response I got.” As such, you either need to use POSTs or attach a dummy variable, with something like new Date().getTime() as part of the parameters to force it to a unique URL each time.

Script elements need an end tag!
It’s often useful to put JavaScript into its own .js file. Note however that the SCRIPT tag, for historical reasons, expects content. It can be empty, but there must be containing something.

Illegal: <SCRIPT type=”text/javascript” src=”yourlib.js” />
Legal: <SCRIPT type=”text/javascript” src=”yourlib.js”></SCRIPT>

Never Use innerHTML
Additionally, you’ll find that a lot of examples use innerHTML to set the property of an element, like DIV. This is wrong. It is not part of the DOM specification, the W3C has deprecated it, and future browsers may not support it — in fact, some browsers already don’t support it now. Use DOM code, it works on any platform. Plus, libraries like Prototype and jQuery have special shortcuts making it possible to access elements by id, element, css type, XPath, and even as a collection. Seriously, the examples in your books are dated and wrong – look for methods like .text() and .html() instead.

Other useful values: document .documentElement (the root node), .parentNode, .childNodes, .firstChild , .lastChild, .nodeType* , .nodeName, .nodeValue , .getAttribute(), .setAttribute() .

* Once again, IE has problems, this time with the with the Node type.

Set Behaviors Elsewhere, If You Can
It’s also tempting to sprinkle code in onClick handlers, but that can make modification difficult for mass changes, not to mention making the HTML uglier. A library called Behavior solves this problem elegantly. You write regular, clean HTML and it will use JavaScript to add behaviors to the tags you specify after the fact. Simply define a set of rules and apply them.

Note: the onclick property of a DOM object is all lowercase, not camel-cased.

DOMs Reorganizes, They Don’t Copy
There’s also some other DOM magic that isn’t obvious. If you have a DOM tree and you get a reference to an element, and then you do an otherElement.appendChild(firstElement) to some other node element, since a DOM node can only have one parent, it actually gets moved. That is, you don’t have to delete anything.

Stuff About AJAX Libraries You Wanna Know


Drag’n’Drop …uh, no… Sortables
Drag’n’Drop in the browser world of AJAX means dragging DIVs and such to different locations on the screen, of which some of those locations can themselves be containers. If you’re looking to rearrange the elements within a container, that is called Sortables. These are container elements (like OL’s and DIV’s) which contain things (like LI’s, DIV’s, and IMG’s), and maintain the order of them. By far the best sortables example I’ve seen was done by Greg Neustaetter and he explains how he did it.

The HTML ID Does Matter!
Turns out many of the AJAX libraries do trickery based upon the ID of the elements. As you’re aware, every ID on a page must be unique in order to pass valid HTML. When the AJAX libraries go looking for elements, this must be true. Additionally, the IDs often have special meanings. For instance, in order to report sequences, the IDs had to be in a form of string underscore integer. ( e.g., Item_10). You can also use a dash instead. AJAX will let you serialize the numerical parts into a string. So, if your id happens to contain additional dashes, underscores, or forgets the numerics, bad things can happen.

Note: An HTML ID can start with a letter, dollar sign, or underscore. After that you can uses numbers, periods, and dashes. IDs are case sensitive, and though their technical size limit is 64K in size (wow!), though don’t count on your browser to honor that. Long IDs can make things slow and chew up memory.

Be Careful With Arrays
There’s a lot of clever overloading going on. Sometimes a parameter is an element, sometimes it’s a class name, and sometime it’s an array. When it comes to sortables (and drag’n’drop), often you need to provide a list of valid containers. This is done by creating an array of strings with the appropriate names and passing that to the AJAX call. As such, it isn’t mandatory to have an array to make things sortable, but only when you’re crossing containers.

Metadata
It is actually possible to pass collections of metadata inside of a class tag! This can be very handy.

<P ID="thing" class="foo bar { xyzzy: 'plugh', abc: 123 }" />

jQuery has a plug-in called metadata (the documentation is in the JavaScript code) that lets you access this.

$("thing").data().xyzzy returns "plugh"
$("thing").data().abc returns 123

AJAX Responses
If an AJAX response returns text, you can access it with .responseText. If an AJAX response returns XML, you can access it with .responseXML and read it just like you would the DOM. And, if the AJAX response send straight HTML, you can always inject it directly into an element with Prototype’s Ajax.Updater.

Currently, the hard part of the problem is taking an XML response from the server and transforming that fragment into HTML using XSLT on the client side.

Normally, a full XML document is transformed into HTML and loaded into the DOM, to which AJAX takes over. The problem is, while AJAX allows for modifying the content of an element, the phase of XML to HTML is already past. Just as XMLHttpRequest() has many different historical quirks, XSLT support and implementation is even worse.

Supposedly, however, there is a library called zXml, and it has a transformToText() function which, in theory, provides cross browser support.

XSLT and AJAX

See the benefits of XSLT. Download, unzip, and drag the .XML file into your browser.


XSLT Example
XSLT_Example.zip

This example separates content, structure, and presentation.

But let’s discuss XSLT in the contents of an entire page.

The magic of XSLT allows the transformation of any arbitrary XML to well formated HTML by rules that you define. And, what’s really spiffy is that you can use XSLT to automatically generate AJAX code as well. However, there are a few tricks to know and a few kinks to watch out for.

XSLT position()
This function returns the element’s position in the tree. The thing to look out for? Whitespace is also an element! As such, if you’re using an <xsl:apply-transformation />, you want to make sure the select statement specifically lists the kind of node you want, and not just some parent element.

XSLT replace() is XSLT 2.0
Turns out some browsers have a problem with XSLT v2.0. Evil. Just evil. And, along this line, so are variables. Some of the really nice features of XSLT might not be possible.

Firefox Hangs When XSLT Generates Scriptaculous
The
Scriptaculous effects library is very clever by being very modular. When you include it, dependencies allow just the pieces you want to load. This has the advantage of making the pages very light weight. It appears to do this feat of magic by injecting content into the DOM at the point you include the <SCRIPT>…</SCRIPT> tag. Only problem is, if the DOM is being generated on the fly by XSLT, bad things can happen. Surprisingly, this seems to be a Firefox-only problem — and I’ve reported the problem to the authors of Scriptaculous. If I get no response, I’m going to the Mozilla people next. IE does not appear to be affected, nor is Safari.

Containers Get Instances
I had some XSLT code which was building my arrays, and deferring initialization of sortable containers until page load completion time. The problem was, for some reason, the containers were not getting initialized at completion. The result was that certain elements weren’t functioning. When I tried initializing as I went, each container got an instance of the array. Follow that again slowly. Sortable containers don’t get a reference to an array, they get a copy of the array, and if the array (which contains all containers you’re allowed to interact with) isn’t fully initialized, your page is broken. Admittedly, a lot of this problem happened because the load order of things between libraries wasn’t clear. Each AJAX library usually hooks into the OnLoad() call, so you better not have one, but you’ll need to see if it put itself first, or last, in the chain.

That sums it up…


That sums up the mental core dump. If you happen to have any tidbits, trivia, or embarrassing corrections, I’d love to hear from you.

Understanding jQuery

jQuery – it’s a AJAX library that uses a very terse notation to do an awful lot in JavaScript. Here’s a good explanation for developers that are unfamiliar with the library — it explains a mental picture of what’s going on, enough that you can pick up the library and start using it for more than just trivial tasks.

I’ve been playing a lot with AJAX recently, and have discovered a library that I’m quickly falling in love with: jQuery.

To provide you with context, I’m a software engineer and have been developing commercial applications for well over twenty years. I’ve played with Javascript when it was young, and I was unimpressed. I played with JavaScript when it was a little more mature, but because of Microsoft’s horrific incompatibilities with Internet Explorer verses the way the rest of the world worked, I gave up. Perhaps prematurely. But, none the less, I didn’t pay any more attention to the world of scripting on the web than whatever problem I had to demanded.

Then along came Ruby on Rails. I was surprised to learn that someone had actually written a library to abstract away JavaScript differences — what a clever solution! To that end, I started looking at Prototype and became impressed at the cleverness of the helper functions. That got me to look at Scriptaculous, and suddenly the world of JavaScript didn’t seem so bleak.

But jQuery. Wow. This library resonated with me very quickly, and I started thinking in and doing more functional programming than I had ever done before (opposed to procedural and object oriented). The library was so easy to use, that I was able to do quite a lot with it without understanding it. That was months ago, but today something clicked. I started to see in my mind’s how exactly how jQuery does its magic, and in such a way as to describe it to someone who’s never used one of these AJAX libraries before.

Javascript, she ain’t that bad


Javascript allows one to define classes. Those classes can be extended — don’t think in terms of derived subclasses, but rather actually plastering on additional methods to a pre-established class. Additionally, those methods can have overloaded signatures. And for the sake of brevity, identifiers we’d never use in other languages, are perfectly acceptable short names. We’ve been taught that although identifiers can start with things like dollar signs and underscores, to stay away — these are for library writers and operating systems people. Even though a single dollar sign might be legally syntactically, one should never do it; though in the world where network speed and space matters, such short names are encouraged. Finally, blocks of code, the very stuff you would call functions, can exist all on their own — all you need is a reference to them, they don’t need a name.

Accept all the above as a given, and a tribute to what’s become of JavaScript while you’ve been playing in other languages.

Groking jQuery


Now at this point, I’ll express my conceptual view of jQuery, and while it may not be technically correct or even how it’s implemented, the mental model will give you gross insights has to how you ought to use the library.

Imagine if you will a class called jQuery. Rather than having to type out jQuery each time, we use an alias, a simple dollar sign (the shortest legal identifier that’s not alphanumeric). Its sole job, internally, is to maintain a collection of references to pre-existing elements in your DOM. This collection may be empty, contain one, or more elements. As the developer using jQuery, you need never see this list; you only deal with the jQuery object itself. Ever.

jQuery has many overloaded constructors, which is how it learns what elements to keep in its internal list. You can provide it a reference to an element, a kind of element, an id of an element, a CSS class used by elements, an XPath to one or more elements, straight blocks of HTML, etc. It can even use another jQuery object (which contains a list). jQuery has exotic syntax for picking very specific elements based on conditions and attributes; it even has filters to removing elements from the list.

The actual list isn’t important, because after it’s done with the constructor, all you have is a jQuery object. And the only thing you can do at that point is call jQuery methods. But, oh how clever is jQuery!

Anytime you call a method of jQuery, it does an internal for-each across its internal list, applying your method to every DOM element in its internal collection. Once more, when it’s done, it returns the very jQuery object that was just used.

Object oriented developers know what this means: you can chain methods, creating long strings of behaviors!

jQuery directly manipulates the DOM in a browser-specific manner under the hood, so that you get one, simple, transparent, elegant way of expressing what you want. The actual implementation details are no concern; if a method exists, it works the same way everywhere, regardless of browser.

And, because jQuery operates on numerous elements by twiddling the DOM, it’s possible to write a small piece of code but hook it in all over the place… a process that used to be quite tedious, but can now be done after the fact, meaning your raw HTML is uncluttered.

The Simple Example


Let’s look at a simple tutorial like example.


$(“.xyzzy a”).click(function(){
alert(“Magic!”);
return false;
});

Quite literally, this says create a jQuery object that is a collection of every anchor in containers with a class of ‘xyzzy’, then assign its onClick event handler to reference a function (that has no name!) that displays an alert message.

In Conclusion


That’s pretty much it. The two things to learn are the number of various constructs and types that can be passed to the constructors and filters, and the other thing is the various methods that affect those elements. That’s the meat of it.

jQuery has other helper functions and such, but those are easily mastered. And, once you’ve got those under your belt, check out the plug-ins that are additional methods bolted on to the jQuery object.