Introduction to Web Application Security

A Prelude in Three Parts

Edward Z. Yang

MIT 2012

An Introduction


Part 1: String is not a type

  1. What is XSS?
  2. Why is XSS bad?
  3. "How to stop XSS"
  4. String is not a type: Format and Context
  5. The Wrong Way, a Better Way, and the Right Way
  6. Practical considerations
  7. The Encoding Story

What is XSS?

Why is XSS bad?

How to stop XSS?

SQL injection is like XSS

(courtesy XKCD)

The anatomy of an SQL injection vulnerability is exactly the same as that of XSS, but instead of HTML, we're dealing with SQL (the effects are also different: with SQL injections an attacker can steal user data or destroy your database, but this is more based on the functionality of SQL versus HTML). The two are manifestations of a more general phenomenon.

String is not a type

What other information is needed to interpret strings in a web application?

A string is not just a string. Where the string is going, and what the format of contents of string is, are vital to understanding what can be safely done with a string.

What does your string contain?

Some possibilities:

Plaintext is the simplest story; it's benign and universal. We can then structure the text, adding extra information to it in the form of markup, so that we can do interesting things like make a word bold or initialize an array with four members (JSON).

Escaping is a format change

Format is context sensitive

<img src="" alt="Double-quoted text" />
    <a title='Single-quoted text'>Regular text</a>

There are four distinct plaintext snippets in this block of code:

Format is context sensitive (II)

Consider the following text snippets:

Example in SQL

Mixing up context in other languages is usually more obvious.

Quiz time!

Secure or not?

We're going to use a few PHP specific examples. For those unfamiliar, htmlspecialchars() is an HTML escaping function that is safe for double-quoted attributes and regular text, and mysql_real_escape_string is an SQL escaping function.

Question 1

<?php echo htmlspecialchars($foo); ?>

Question 2

<a href="<?php
  echo htmlspecialchars($foo); ?>">

Question 3

<?php echo mysql_real_escape_string($foo) ?>

Question 4

In a JavaScript file:
var variable = "<?php
    echo htmlspecialchars($data);

Side note: The above-mentioned use of JSON is not conforming, and there's some talk of changing the function in PHP. Buyer beware!

String is not a type: A summary

The wrong way

A setup like this makes it really difficult to answer the two questions we posed earlier. Did we escape it already? Do I have to de-escape and re-escape? What escaping function should I use? You shouldn't need to ask these questions.

A better way

The right way

How do we, on a programmatic level, enforce the design principles in the previous slide?

For SQL: Bound Parameters

For HTML: DOM Builder

An added benefit of DOM builders, beyond security, is the fact that their output is guaranteed to be well-formed.

For Shell Code: Multiarg Exec

For URLs: URL builder

This is, admittedly, not a great example. You'll appreciate URL builders more when you get a URL with an existing query string, and then you have to add another parameter to it. In that case, if you were given a string URL, you'd have to test append an ampersand and the new pair, unless your parameter already exists or if there are no query keypairs yet. With a builder, you simply write in the hash value, and then go your merry way.

Practical considerations

So if Safe APIs are so good, why hasn't everyone switched to using DOM builders yet?

Verbose and difficult to use

Here is an example of writing HTML with concatenation:

  Welcome <em><?php echo htmlspecialchars($username) ?></em>.
  Here's a <a href="">link</a>

And a corresponding example with a DOM interface:

$p = $doc->createElement('p');
$p->appendChild($doc->createTextNode('Welcome '));
$em = $doc->createElement('em', $username);
$p->appendChild($doc->createTextNode('. Here's a '));
$a = $doc->createElement('a', 'link');
$a->setAttribute('href', '');

The DOM version is substantially longer and more difficult to understand.

Verbose and difficult to use (II)

Consider another function commonly used to format text entries:

<?php echo nl2br(htmlspecialchars($text)); ?>

The DOM equivalent is:

foreach (explode($text, "\n") as $i => $part) {
    if ($i !== 0) $b->addChild($doc->createElement('br'));

One last thing: DOM tools are XML-oriented, so expect a bit of post-processing, especially with XSLT

Verbose? Maybe you can fix it!

Not native

Performance and memory usage

The Encoding Story

What is text?

Numbers given form (ASCII)


Multibyte encodings (UTF-8)


I'm not here to evangelize UTF-8/Unicode, but use it!

Why should you care?

Checking for Well-Formedness

Checking for invalid codepoints

Bold codepoints are not permitted in XML; italicized codepoints are not permitted in XML but are permitted in HTML (such is the strange story of the form feed). If you want to be safe, nuke all of these codepoints.

A quick note on implementing a function that removes these invalid codepoints: your regular expressions library probably has native support for Unicode, so use expressions like \x{FFFE} to match noncharacters and strip them out.

Part 2: String filtering is crypto

Expanding the scope

How to do it?



It's complicated.

What is not safe?

What is safe?


How to do it (for HTML → HTML)

Some numbers

Shopping for a filter library

  1. Does it use a whitelist?
  2. Does it parse the HTML?
  3. Does it check attributes?
  4. Does it pass the XSS cheatsheet?
  5. Is it well known and widely used?

First off, a disclaimer: I wrote a filter library, so I'm a little partial in this domain. Still, I think this is a pretty good smoketest for evaluating a filter library. 1-3 deal with fundamental architectural decisions, 4 is a practical test and 5 is good to have, because it means that updates will be released more frequently and that the library has had more eyes on it. Missing one of these is not grounds for excluding a library, but certainly be more cautious in such circumstances.


And remember...

Update, update, update.

This is especially important if the filter is blacklist based, since new attacks will be discovered.

Format shifting

HTML isn't exactly the most user-friendly format, so you'll often want to offer another language like BBCode, Markdown, Textile or Wikitext. In such cases:

Part 3: Request Forgery

The assumption

How to forge a GET request

<img src="" />

Okay... pretty easy you say...

How to forge a POST request

<script type="text/javascript">
    var xhr = new XmlHttpRequest();"POST", "", true);

Also doable with an autosubmitting form. And yes, I know that code isn't portable. This isn't a class on how to haxor websites.

CSRF protection

<form method="post" action="logout.php">
    <input type="submit" name="logout" value="Log out">
    <input type="hidden" name="token" value="RANDOM">

Protection in practice

Protection in practice (II)


And now, something new

The elegance of this attack is the fact that it bypasses all of the previous protections we may have put up for CSRF: the user is actually physically clicking on the link or submit button, and there is no way to tell if it was intentional or not. It is like slightly like social engineering, but unlike in that the actions a user may make are completely reasonable.

ClickJacking protection