Validating and sanitizing data in PHP

Table of contents

One of the first things when receiving data from a request is to validate if they match the expected format and sanitize potentially undesired contents. Since this is such a common task, PHP provides a default extension for this purpose, filter.

Using the filter extension

The filter PHP extension provides a few common filters to validate or sanitize data. You use it by calling filter_var() with the value to filter and the filter to apply. You may supply options as the third parameter that change the behaviour of the selected filter. The function will return the filtered value on success, or false on failure.

$input = "14";
$result = filter_var($input, FILTER_VALIDATE_INT, FILTER_NULL_ON_FAILURE);
var_dump($result);

In this example, we picked FILTER_VALIDATE_INT to check if $input is a valid integer. Additionally, we specified the option FILTER_NULL_ON_FAILURE, which will make it return NULL instead of false if the validation check fails. Since the string in $input is an integer, the example will return 14. Changing $input to for example "b" will make it return NULL instead.

Filters and flags are defined as global constants by the filter extension. Validation filters will start with FILTER_VALIDATE_ while sanitization filters will have the FILTER_SANITIZE_ prefix. The documentation contains detailed explanations of all validation filters, sanitization filters and their option flags.

Validating an email address

A common use case of validation is to check if an email is correctly formatted. Here is how that would look using filter_var():

$email = "bob@sample.com";
$result = filter_var("sample@@somewhere.com", FILTER_VALIDATE_EMAIL);
var_dump($result);

Email validation is not as trivial as you would expect. From just checking if it contains @ and . characters, to multiple lines of regular expressions, there are countless strategies for accomplishing this task, most either being incorrect or using a lot of resources. The advantage of using filter_var() is that it properly abstracts all that away from you, making validation of email addresses according to RFC 822 easy and efficient.

Note that this will only check if the email format is valid, not if the mailbox (or domain) actually exists.

Sanitizing a username

While rarely given much attention, usernames can be tricky at times. Imagine you identify users uniquely by username. What if a user enters a username that looks the same as someone else's, except with some invisible characters inbetween? For the server they would be different, but for humans they are visually identical.

Stripping such undesirable characters from strings is a prime example of sanitization with filter_var():

$input = "user\nna\rme";
$result = filter_var($input, FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW|FILTER_FLAG_STRIP_HIGH);
var_dump($result);

As you can see, we sneaked a newline \n and carriage return \r symbol in the $input variable, but filter_var() simply returns "username" with all non-printable characters removed.

Be careful when assigning the results of sanitization filters directly, as even they can fail (see next paragraph for an example of that).

A closer look at filter_input()

Filtering commonly happens to data received in a request. For this purpose, the filter extension includes the filter_input() function, which takes a value directly from the supplied input source (given as constants, for example INPUT_GET for $_GET or INPUT_SERVER for $_SERVER, etc).

When reading the documentation, you may be confused why there are both filter_var() and filter_input().At first glance, they seem to do the same thing, for example these 2 lines are equivalent:

$n = filter_var($_GET["number"], FILTER_SANITIZE_NUMBER_INT);
$n = filter_input(INPUT_GET, "number", FILTER_SANITIZE_NUMBER_INT);

While they seem equal on first glance, filter_input() behaves slightly differently from filter_var():

For starters, it returns the value on success, false on error and NULL if the variable was not present in the input source. This saves you from having to first check if the variable $_GET["number"] even exists before filtering it, as it is done for you implicitly.

The second difference is that filter_input() will operate on the data received int he request, not the current state of the variable. Assume this script was called without any GET parameters:

$_GET["number"] = "11";
$n = filter_var($_GET["number"], FILTER_SANITIZE_NUMBER_INT); // $n is now 11
$n = filter_input(INPUT_GET, "number", FILTER_SANITIZE_NUMBER_INT); // $n is now NULL

Even thought we changed $_GET["number"] within our code, filter_input() read it directly from the input and not our altered variable. This may prevent some attack vectors or safeguard against side effects of functions like extract().

Filtering multiple variables at once

When receiving data, you often need to validate multiple fields. To make this process easier for developers, the filter extension includes array variants: filter_var_array() and filter_input_array().

Assume you are validating data for a new user signing up:

$filters = array(
   "username" => ["filter" => FILTER_SANITIZE_SPECIAL_CHARS,
               "flags" => [FILTER_FLAG_STRIP_LOW, FILTER_FLAG_STRIP_HIGH]
               ],
   "age" => ["filter" => FILTER_VALIDATE_INT,
            "options" => ["min_range" => 18],
            ],
   "email" => FILTER_VALIDATE_EMAIL,
);
$userData = filter_input_array(INPUT_POST, $filters);
var_dump($userData);

This condensed syntax lets us specify and apply all filters at once. If you have trouble understanding the syntax, the documentation explains it in more detail.

The resulting $userData will be an associative array with each of the named fields set to the result of applying the filters to the input value of the same name. This means that for example $_POST["email"] will be filtered using FILTER_VALIDATE_EMAIL and the result (the passing email address or false) will be available at $userData["email"]. You can check each result individually or simply loop over the entire $userData to check if any of the fields is false or NULL.

More articles

Object oriented PHP cheat sheet

A condensed view of all object-oriented PHP features

Setting up a LAMP stack for development in docker

Streamlining your local PHP development environment

Responsible web scraping considerations

Web scraping within legal limits, explained for humans

Understanding the linux tee command

Copying stdin to multiple outputs

Protecting linux servers from malware with ClamAV and rkhunter

Finding malicious files the open source way

Creating an effective disaster recovery plan for IT infrastructure

Know what to do when things go wrong ahead of time