When developing websites it is easy to overlook pieces of information that may appear innocent, but can be dangerous in the hands of experienced attackers. Understanding what information is exposed on a web application and what malicious actors could infer from it is the first step to protecting the software and its users from exploitation.
To illustrate how simple the fingerprinting/reconnaissance is on a technical level, we will only use the curl
and grep
commands in this article, as well as tidy
to clean HTML contents before processing. On a debian system, you can install them in a single command sudo apt install curl tidy
.
HTTP headers
The headers of HTTP responses are a good place to start fingerprinting web applications. We only need curl for this:
curl -I https://example.com
A response may look like this:
HTTP/1.1 200 OK
Server: Apache/2.4.41 (Ubuntu)
X-Powered-By: PHP/7.4.3
Content-Type: text/html
Date: Wed, 26 Feb 2025 10:44:05 GMT
Although small, the server gives a bad actor a lot of information to deal with: It exposes the operating system, web server + version and used programming language + version. This alone can be enough to exploit the system, for example if the web server or PHP version have known vulnerabilities.
Making version information publicly accessible or being overly open with details about the backend system are bad practice, because these headers serve no practical purpose (other than advertising). Configuring the server to hide this information depends on the setup you are using, but is generally simple. For our example, adding ServerTokens Prod
and ServerSignature Off
to apache2.conf
, and expose_php = off
in php.ini
would resolve the issue.
With the configuration adjustment in place, the response headers are much shorter:
HTTP/1.1 200 OK
Server: Apache
Content-Type: text/html
Date: Wed, 26 Feb 2025 10:44:05 GMT
With the reduced amount of information, an attacker will have a less clear picture of what is happening on the backend side.
Detecing common backends
The next step is looking at the HTML of the response, specifically looking for commonly used software. Since writing websites from scratch is often overkill, many developers rely on tested backend CMS like wordpress, joomla or drupal. These typically advertise themselves in the <meta name="generator">
HTML tag, which we can look for automatically:
curl https://example.com | grep -i '<meta[^>]*name=[^>]*generator[^>]*>'
Sample outputs may expose the CMS in use, sometimes even with specific version numbers like one of these:
<meta name="generator" content="WordPress 6.3">
<meta content="Joomla 4.2" name='generator'>
If that returns no results, the exposing tag might be disabled. CMS can still be identified through other means, for example path names.
Since the HTML may not be formatted nicely (e.g. optimizations like minification), we are using tidy
with the -q
(quiet) and -i
(indent) flags to turn it back into readable code with one HTML tag per line.
Wordpress typically loads assets from paths under /wp-content/
or /wp-includes/
:
curl https://example.com | tidy -qi | grep -iE '/wp-content/|/wp-includes/|wordpress'
Joomla often uses paths starting with /templates/
or /media/
:
curl https://example.com | tidy -qi | grep -iE '/templates/|/media/|joomla'
Drupal often uses paths starting with /core/
or /templates/
:
curl https://example.com | tidy -qi | grep -iE '/core/|/modules/|drupal'
The output of these commands does not guarantee that a CMS is actually in use and requires manual assessment, but finding multiple matches for one query is a pretty good indicator for used the backend CMS, if any.
Knowing which CMS is running on the backend provides a clear picture into how backend directories are structured, and opens the door for known vulnerabilities if the underlying CMS or plugins are outdated. Using common software is both good and bad for security: it is well-tested and issues are likely to be found, reported and fixed quickly, but when the system is not kept up to date constantly, it also becomes vulnerable to publicly known exploits and make easy targets.
Finding comments
Comments are a great way to document code or leave notes for team members, but in production deployments they should not be present, especially not in files served to visitors in plaintext (html, js, css etc).
For fingerprinting, we care about two types of comments. The first are html comments:
curl https://example.com | tidy -qi | grep -Pzo '<!--[\s\S]*?-->'
The grep
portion is using a little trick to match multi-line comments as well. Normally, grep would process lines individually, so we use -z to change the line separator from newline \n
to a nullbyte \0
, effectively turning the entire input into "one line". Next, we use -o
to strip anything from a matched "line" that wasn't part of the expression, and enable PCRE support with -P
to make the match non-greedy with the ?
symbol (so we don't match markup between comments). Since we want to match anything within comment markers <!--
and -->
we can't use .*
since that wouldn't match newlines, thus breaking multiline comments. Instead, we match all whitespace characters \s
and non-whitespace characters \S
.
The second type are javascript/css comments for scripts that are embedded in the page directly (not loaded externally).
Single-line comments:
curl https://example.com | tidy -qi | grep -o '//.*'
And multi-line comments:
curl https://example.com| tidy -qi | grep -Pzo '/\*[\s\S]*?\*/'
Comments aren't immediately problematic, but they shouldn't be present on production deployments and open the door for forgotten or overlooked information not meant for external eyes. It is common to leave notes in code, especially more complex sections, to help other developers or team members better understand the bigger picture or functionality, but exposing that knowledge to a malicious actor can be dangerous.
Use automated tools to minify source code of web responses and strip all comments to prevent accidental leaking of secrets or information through leftover comments.
Asset paths and versions
Another often underestimated threat are is information that can be inferred from present assets. Using third-party libraries or components is common practice in software development, and web applications are no exception. But installing plugins or extensions in common CMS also hides the risk of enumeration through frontend assets.
Imagine you install a plugin to handle file uploads, or perhaps a image gallery for your website. That plugin may now load a javascript or css styles for the frontend of the upload/gallery, potentially with version numbers. An observant attacker can infer that your backend is likely using the corresponding code to process the images, allowing them to check if any known vulnerabilities exist for that version. Even if they don't know the exact version or it has no known exploits at the moment, it gives them a far better idea of how your backend may be structured, or how to combine multiple points of interest to pull off an attack anyway.
Finding script and stylesheet links is rather simple
curl https://example.com | tidy -qi | grep -Eo '(<link[^>]*href|<script[^>]*src)[^>]*>'
The expression finds lines containing either <link>
elements with href
attributes (stylesheets) or <script>
tags with src
attributes (external js files).
An attacker may look at the names present in the path, looking for common or publicly known plugins or extensions. Let's look at a an example you may find:
/wp-content/plugins/galleria/galleria.min.js?v=2.1.0
From the path alone, it is immediately obvious that the target system is running wordpress
with the galleria
plugin version 2.1.0
installed. Appending the version of the file as a GET parameter like ?v=2.1.0
in this example is fairly common to prevent issues with outdated cache contents (the path doesn't change between plugin versions, so updating may leave some client with outdated dependencies). This trick helps reduce friction for less experienced users, but also introduces a security issue: attackers now know the exact version of the plugin's backend code, whether it is up to date and what logic is executes.
Of course, the links alone don't tell a complete story. Checking the contents of existing <script>
or <style>
tags can be equally as fruitful, although not as easily automated.
Further reconnaissance
The fingerprinting approaches discussed in this article are by no means exhaustive, but rather the starting point of a potential attack or audit. Even without leaving the reconnaissance phase, there is a lot more information to glean from a website that isn't as easily automated, for example secrets or credentials exposed in javascript code, or AJAX requests to backend endpoints or external APIs. Checking common resources like robots.txt
or the sitemap may uncover more hints about backend structure and logic. Securing a web application isn't a straight-forward process, but rather requires the developers to understand the value any piece of information has for an attacker, and build even internal systems or endpoints that aren't expected to receive manual traffic with the possibility of bad actors in mind.
The impact of fingerprinting
Reconnaissance is only the first step in exploiting a system, meant to gain a better understanding of the target. Having more information allows narrowing the choice of attacks to attempt or exclude a large chunk that aren't applicable (version mismatch, incorrect target os/server etc). Any hint can be helpful for an attacker, especially when combined with other pieces of information gained from different pages.
Attackers are often emboldened by seemingly "easy" targets, also called "low-hanging-fruit", so exposing this information combined with an outdated system will drastically increase the chances of getting attacked and successfully exploited. Keeping publicly exposed hints about software and versions to a minimum where possible reduces the attack surface and discourages most attackers, because finding out more about the page would be considerably more work (and likely riskier, e.g. testing for plugins by common paths would generate errors in log files etc).