Sanitize Your Inputs?

I'm often accused of being particularly fussy with regards to language and word choice, especially in technical discussions. It's true, but I'll wear that badge with pride. In software engineering, there are many instances where clear communication is so critical that the success or downfall of an entire organization may rest upon it.

There's one particularly slippery term that wreaks havoc in the pursuit of application security.

Sanitize.

I say it's slippery because there is simply no industry-wide agreement on its meaning, and therefore when used, the speaker and his or her audience cannot be entirely sure they understand each other. Its appearance in any discussion should immediately prompt the question, "What do you mean by that?"

Does it mean removing undesirable data while letting the good stuff through? Or converting potentially harmful data into a harmless form? Or flat-out rejecting a request when any invalid data is detected? Or perhaps it even means using prepared statements to protect the database from malicious input. I've seen "sanitize" used to mean any (and even all ) of these things.

That's worrisome because these techniques are not interchangeable, especially when it comes to preventing SQL injection. In that case, using prepared statements is the only way to reliably protect your database from SQL injection attacks without the risk of mangling incoming data .

Perhaps the author of the famous Bobby Tables comic actually intended the mom's snarky response to mean "use prepared statements" instead of filtering the input, but that would be entirely lost on the beginner developer who reads the comic and Googles "sanitize database inputs" to find scores of highly-ranked guides that confidently recommend modifying the input string. (Thank goodness the one guide that tends to top the search results makes it clear that "sanitizing" your inputs is prone to error and promotes prepared statements instead.)

Sanitize your inputs? I think not.

Not just because it's wrong. Because it's meaningless.

Let's look at a few fundamental principles for application integrity, and through them find the best language for clear communication.

Validate on Input

At every stage of input, ensure that the incoming data is valid according to the requirements of that part of the application. There are many layers in any application, and they all have a job to do. Each one expects to be given certain information that it needs to do its work, and it pays dividends to be as explicit as possible .

Does this php class method need a date and time to do its job? Type hint that you're expecting DateTimeImmutable in the method signature, and if the code calling that method doesn't provide the right information, PHP will throw a TypeError . This is validation using the built-in capabilities of the language right at the point where the method is being invoked.

class Foo
{
public function bar(DateTimeImmutable $dateTime)
{
// Do something with $dateTime
}
}

Need a positive integer? Declare the parameter type to be int , and consider using an assertion library like Webmozart Assert to require the incoming data to be greater than 0 before any other work is done in the method. This combines built-in validation features and a broadly-used, well-tested third-party solution to ensure you're working with meaningful data.

use Webmozart\Assert\Assert;
class Foo
{
public function bar(int $eventId)
{
Assert::greaterThan($eventId, 0, 'The event ID must be a positive integer. Got: %s');
// Do something with $eventId
}
}

Expecting an argument to be a string with the value of either "month" or "year"? If the value of the incoming data doesn't match one of the two (and your business dictates that it's not possible to set a reasonable default), throw an InvalidArgumentException (or use Webmozart Assert's oneOf assertion). This is a technique called whitelisting .

class Foo
{
public function bar(string $timeFrame)
{
$timeFrame = mb_strtolower($timeFrame);
if (! in_array($timeFrame, ['month', 'year'])) {
throw new \InvalidArgumentException(
"TimeFrame must be either 'month' or 'year'. Got: {$timeFrame}"
);
}
// Do something with $timeFrame
}
}

Even better, consider using enums when the value should always be one of a declared (i.e. enumerated) list of values.

Input validation is stricter than what most developers imagine when they think of sanitizing inputs. Rather than merely "cleaning" the incoming data, we're ensuring it adheres to a very specifically-defined format or rejecting it entirely.

By declaring and enforcing these expectations, the application is a lot less likely to exhibit unexpected or undesirable behavior, the playground of nearly all security vulnerabilities. This approach is not sufficient to protect against any threat―no single technique is―but ensuring the integrity of the data moving around the application goes a long way in reducing an application's attack surface.

Beyond the improvement in security, your engineering team will enjoy working in a far more intelligible codebase, and the business will benefit from more reliable features delivered more quickly.

Read more on input validation in the excellent article The Basics of Web Application Security on Martin Fowler's website.

Send Query and Parameters Separately to the Database

SQL injection happens when an attacker sneaks additional database instructions into your existing query. As noted above, the most famous example is smuggling a "drop table" statement in with an existing query, designed to maliciously destroy an entire database table.

The technique to prevent this type of attack is fairly straightforward. Isolate data from the instructions designed to operate on it, then (and this is important) literally send them as separate messages to the database server.

This allows your application to query the database server like so: "Give me all the columns from the Students table for rows where the first_name is ___; I'll send a separate message with something to fill in the blank." Then a very short time later, your application sends another message, "Fill in that blank with Robert ."

If the database server receives Robert'); DROP TABLE Students;-- instead, it won't execute the DROP TABLE statement. The database server knows that's a value, so it won't let it alter the original instructions it received. It will treat that value literally, search for a student named Robert'); DROP TABLE Students;-- , and return nothing.

It's straightforward, fool-proof, and unlike "sanitizing" an input string, carries no risk of accidentally mangling the incoming data.

For more, read my post on using prepared statements to prevent SQL injection attacks .

Encoding is simply the conversion of data into a format that can be understood by an external consumer. You're already doing it to some degree, even if you don't realize it.

At the end of most requests, your PHP application will have something to output. The classic Hello, world! example spits out a simple HTML page with the phrase "Hello, world!" That HTML page is one particular format, and when it's a browser making the request, it's a format the browser knows how to interpret.

In the context of a web application's output, encoding is a concept that encompasses several ways of preparing that outgoing data (advisably in this order):

Filter out anything that shouldn't be there (e.g. remove all but a few permitted HTML tags from user-generated content that came from a WYSIWYG field) Escape strings that may contain harmful characters (e.g. replace single quotes in javascript with their escape sequence counterparts ) Package the information in a format that the client expects (e.g. HTML, JSON, XML, etc.)

It's not uncommon to see applications filter and escape data coming into the application ― this is what's often called "sanitizing" ― but that should really be avoided.

There are security concerns at stake:

If you store sanitized data in a database, and then a SQL injection vulnerability is found elsewhere, the attacker can totally bypass your XSS protection by polluting the trusted-to-be-sanitized record with malware.

Paragon Initiative Enterprises, The 2018 Guide to Building Secure PHP Software

And encoding on input raises maintainability concerns as well:

Be warned: you might be tempted to take the raw user input, and do the encoding before storing it. This pattern will generally bite you later on. If you were to encode the text as HTML prior to storage, you can run into problems if you need to render the data in another format: it can force you to unencode the HTML, and re-encode into the new output format. This adds a great deal of complexity and encourages developers to write code in their application code to unescape the content, making all the tricky upstream output encoding effectively useless. You are much better off storing the data in its most raw form, then handling encoding at rendering time.

Cade Cairns and Daniel Somerfield, The Basics of Web Application Security

So encode where it makes the most sense: on output. And don't rely on these techniques to protect your database or ensure the validity of data flowing around your application. Escaping and filtering might sometimes provide those security benefits accidentally, but that's not what they were designed to do and they often come with a hidden cost.

The sources of the preceding quotes offer a wealth of information on the how and why of output encoding, and I highly recommend them both for further reading: the Encode HTML Output section of The Basics of Web Application Security ; and the Cross-Site Scripting (XSS) section of The 2018 Guide to Building Secure PHP Software .

So the next time you're tempted to use "sanitize" to mean...

removing undesirable data while letting the good stuff through?

May I recommend "filtering" instead?

Or converting potentially harmful data into a harmless form?

Calling it "escaping" would ensure your point is made clearly.

Or flat-out rejecting a request when any invalid data is detected?

Opt for "validation" instead.

Or to protect the database from malicious input?

Remember that the only reliable solution is using prepared statements .

We don't build this stuff alone. Software engineering is a human endeavor, and building great software means working well with other people. I hope this overview helps encourage much clearer communication amongst your team.

Just please, whatever you do... stop sanitizing your inputs.

Update:A few people have suggested that many of the concepts here are covered by the adage "Filter Input, Escape Output". I have no desire to reinvent the wheel, so if there's an industry standard, I certainly want to stick to it.

However, I could only find very limited usage of this maxim. The majority of the references are at least a decade old, and all references are from PHP-related discussions that directly point back to Chris Shiflett's blog post and book , suggesting the maxim didn't gain traction outside the PHP community. Detailed discussions reveal that this maxim's use of "filter" and "escape" suffer from ambiguity as well. The words used to clarify what "filter" and "escape" mean tend to be the very language recommended here.

For the sake of clarity, I've opted to recommend OWASP's terminology for the input stage and to break down the several parts of preparing output ― which involves more than just escaping strings ― using modern, well-respected sources both inside and outside the PHP community.

Latest Images

Trending Articles

Latest Images