So, I reduced the problem down to a very trivial case.
What would you suppose the following code block does in a browser?
To my eyes, this should produce an alert box with the simple text </SCRIPT> inside it. Nothing special.
However, in all browsers (IE 7, Firefox, Opera, and Safari) on all platforms (XP/Vista/OS X) it didn’t. The close tag inside the quoted literal terminated the scripting block, printing the closing punctuation.
Change </SCRIPT> to just <SCRIPT>, and you get the alert box as expected.
So, I did more reading and more testing. I looked at the hex dump of the file to see if perhaps there was something strange going on. Nope, plain ASCII.
In fact, if you start playing with other strings, you get these results:
<\/SCRIPT> …displays </SCRIPT>, and while I suppose you can escape a forward slash, there should be no need to. Ever. See prior example.
</SCRIPTX> …works (note the extra character, an X)
It was after this I turned to ask for help from several security and web experts.
Why security experts?
The primary concern is obviously cross site scripting. We’re taking untrusted sites and displaying portions of the data stream. Should an attacker be able to insert </SCRIPT> into the stream, a few comment characters, and shortly reopen a new <SCRIPT> block, he’d be able to mess with cookies, twiddle the DOM, dink with AJAX, and do things that compromise the trust of the server.
The explanation came from Phil Wherry.
What the HTML parser saw was this:
And there you have it, not only is the syntax error obvious now, but the HTML is malformed.
- Escape all quotes, replacing ' with \', and " with \" — this stops the string from getting terminated.
- Escape all angle brackets, replacing < with \<, and > with \> — this stops the tags from getting recognized.