Protecting web forms from bots

Headless bots and scrapers crawling the internet and submitting any form they find with spam are almost as old as the internet itself. Left unprotected, everything from comment fields to contact forms is at risk of being abused for automated and unsolicited spam that often can't be opted out of. Defending against this kind of malicious use is important to keep employees free of useless workload and your inbox clean.

The threat model

Before we can discuss how to defend against bots, we must first understand what types of automated bots exist and how they differ.

There are four primary types of bots we need to consider; let's call them "scrapers", "headless browsers" and "replay bots" and "agentic browsers".

Scrapers download a website's source code, find <form> elements and try to submit them. These are extremely cheap to run, but can become a little more sophisticated, for example by using regex or known name lists to identify text-based captcha field, automatically solving maths puzzles and intelligently avoiding honeypot inputs.

Headless browsers instead use real browsers in a headless (non-graphical) mode. They actually render the target website, including javascript files and events. These are the most sophisticated type of bot, defeating many automated protections and often utilizing third-party services to solve graphical captchas. Defenses like using hidden honeypot fields are useless against them.

Replay bots go a step further and let headless browser (or human) fill out the target form once and record the resulting HTTP request sent over the network. Then they reuse the same HTTP request, only swapping out desired fields like emails and message contents. They omit many defenses like scripts intercepting the submit event on the form and adjusting fields or contents. They also work well with javascript-based frontend frameworks like angular or react.

The newest addition to bot types are headless browsers driven by ai models interpreting screenshots of their current state and interacting with the page dynamically, mimicking real human behavior. These are much harder to detect and easily bypass complex problems like visual captchas, but need very long to complete a form submission compared to the other options.

The cost of bots

The different kinds of bots widely differ in runtime costs: Simple scrapers are extremely cheap and scale very well. A headless browser in comparison is almost comically expensive per request for multiple reasons: it needs to actually render the page, execute scripts (requiring a memory-heavy javascript runtime and using cpu resources depending on what scripts are on the page), and lastly rendering the page involves downloading assets like images, video- and audio files, stylesheets and so on.

Replay bots cut a great balance in favor of the bot operator if the form is replayable by paying the resource cost once, then falling back to cheap network requests afterwards.

In order to optimize cost, large-scale bot operators nowadays preprocess the html input to reduce cost as much as possible by stripping resources they don't deem necessary.

This includes visual elements like images, video- and audiofiles, <meta> tags (like dns-preload/dns-preconnect), but also known scripts that do not contribute to DOM structure like google analytics.

Outdated defenses

A few years ago, it was easy to trick bots with even easier tactics, many of which have since been patched. Spam companies aren't asleep on the wheel and are constantly looking to improve and adapt to changes (their income depends on it, after all!).

Let's look at outdated solutions that are not functional anymore:

Honeypot fields were a hidden input that looked like real fields to naive scrapers, but visually hidden so real users wouldn't touch them. Consider this form:

<form>
  <input type="text" name="name" required>
  <input type="hidden" name="email">
</form>

A real user couldn't see the email input field, but a bot might fill it out. Over time, the hiding got more sophisticated, using css attributes like display: none;, visibility: hidden or even opacity: 0; instead of type="hidden". Modern bots now scan css styles for these and flag input fields affected by them as suspicious, and try submitting forms with and without the suspicious fields to bypass the honeypot.

Runtime form adjustments used javascript to intercept the submit event on forms and add fields on the fly to keep bots without javascript capabilities out:

<script type="text/javascript">
  let form = document.querySelector("#myform");
  form.addEventListener("submit", () =>{
    const hiddenInput = document.createElement('input');
    hiddenInput.type =  'hidden';
    hiddenInput.name =  'js_enabled';
    hiddenInput.value =  'true';
    form.appendChild(hiddenInput);
  });
</script>

Headless browsers and replay bots remain unaffected by naive scripts like this, and even scrapers now use heuristics to guess form changes without running scripts at all.

Maths puzzles generated simple maths-based problems like 5 + 4 and a hash of the solution in a hidden input field, assuming humans are capable of simple maths but bots would struggle.

<form>
  <input type="text" name="captcha" placeholder="5 + 4" required>
  <input type="hidden" name="solution" value="checksum of solution">
</form>

Even simple scrapers added regular expressions and parsers to detect and solve these puzzles on the fly, making them near useless for modern pages.

Visually distorted text in images simply printed a captcha text, often just 4-6 letters/digits onto a small image, then distorted it to make it difficult to read. The idea was that humans could still read the text and type it into the captcha field, while bots could not. This kept bots out for a few years, but modern image recognition (OCR) is both advanced and cheap enough to break this kind of captcha in seconds for very low cost.

IP reputation databases recorded known bad actors or previously detected bots, but have become worthless since the introduction of seemingly boundless cheap IPv6 addresses and residential proxy services.

Rate limiting is often mentioned in this context, although it is misleading: it prevents a single client from submitting multiple consecutive requests, which modern bots don't do anymore. They space out requests and submit unique (often paid for) advertisement messages. Having enough clients allows them to cycle one message per client instead of idling while waiting for a timeout, dodging all involved cost.

Defense #1: Increasing bot cost

Now that we understand how the different types of bots work, let's talk about the most value-for-effort defense. Simple scrapers are the most popular in use because of their low runtime cost, so simply requiring the page to have javascript enabled gets rid of them.

Involving a media element like <img> in the form submission removes most headless browsers.

Replay attacks are a little trickier: They require a CSRF (cross site request forgery) token, basically an arbitrary string that is sent along with the form and invalidates when the form is submitted the first time. Think of it as handing out a "ticket" when loading the page, and submitting the form requires sending a valid ticket. Once a ticket is used, it expires and cannot be used again.

These are typically stored somewhere related to the user but out of reach for them (like serverside user sessions or key-value stores, NOT cookies that they could alter!).

Combining these approaches could look like this:

<form action="/contact.php" method="POST">
  <input type="text" name="name" placeholder="Name" required>
  <input type="email" name="email" placeholder="Email" required>
  <input type="hidden" name="csrf_token">
</form>
<img
  src="data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22/%3E"
  onload='
document.querySelector("input[name=csrf_token]").value="__mytoken__"'
  width="0"
  height="0">

Now the <img> element will likely be stripped by headless browser bot networks because it does not hide visibility using css, instead using image dimensions of 0x0 px to effectively hide. Using onload= forces the browser to have a javascript runtime, and as long as the __mytoken__ placeholder is unique for each request and invalidated once used, the page can't be pre-rendered for replay bots.

A php implementation may look like this:

<?php
  session_start();
  function generate_new_csrf_token(){
    $_SESSION["csrf_token"] = bin2hex(random_bytes(32));
  }
  if(!isset($_SESSION["csrf_token"])){
    generate_new_csrf_token();
  }
  if(isset($_POST["name"], $_POST["comment"], $_POST["csrf_token"])){
    if($_POST["csrf_token"] != $_SESSION["csrf_token"]){
      // form error: invalid token (replay attempt)
    }else{
      // good token, send mail and generate new token
      // mail(... 
      generate_new_csrf_token(); 
    }
  }
?>
<html>
  <body>
    <form action="" method="POST">
      <input type="text" name="name" placeholder="Your name" required>
      <textarea name="comment" required>Your comment</textarea>
      <input type="hidden" name="csrf_token">
      <button type="submit">Post comment</button>
    </form>
    <img
      src="data:image/svg+xml,%3Csvg%20xmlns=%22http://www.w3.org/2000/svg%22/%3E"
      onload='document.querySelector("input[name=csrf_token]").value="<?=$_SESSION["csrf_token"];?>"'
      width="0"
      height="0">
  </body>
</html>

While simple, it effectively defeats the vast majority of spam bots on the internet, and adds substantial cost to targeting your website with automated tools.

Defense #2: Interactive visual captchas

The most well-known bot protection nowadays are interactive visual captchas, like Google's reCaptcha v2. They require a user to click on them, then solve a problem like placing a puzzle piece at the correct location, or selecting images based on content ("Select all images containing stairs").

They are well-known because they cause heavy user friction, requiring several human actions to complete and delaying them in their goal to submit a form or visit a site. Frustration builds especially when captcha results are uncertain and require a second or third consecutive puzzle to build confidence that the request comes from a human.

While most of their functionality is visible, they also run some internal checks, often checking browser environment, ip address and solving speed for known bot patterns.

Their effectiveness is diminishing with the rise of automated solving services, using a combination of real users and image recognition to complete the challenges, while annoying actual human visitors and customers.

Defense #3: invisible captchas

The newest entry to automated bot defense are "invisible captchas", like Google's reCaptcha v3 or FriendlyCaptcha. They use pattern recognition and invisible background logic to try and detect bots, without human visibility or immediate friction.

Detection mechanisms vary, but they can be roughly grouped into three categories:

Behavioral pattern analysis involves tracking user interactions: time between page load and form submission, click and mouse movement precision/acceleration, browser environment and hardware fingerprinting. Google's reCaptcha falls into this category, running extensive background checks and combining results in a global database to share risk scoring across pages. It offer high precision results at telling humans and bots apart, but is also the worst privacy option out there, sending large amounts of profiling data of real humans to centralized storage.
Proof of work like FriendlyCaptcha chooses a radically different route, aiming to increase the cost of bots to deter large-scale attacks. They make clients solve memory- or cpu-heavy cryptographic problems before submitting a form, which is fine for a single human on one device, but quickly gets out of hand at 100s or 1000s of automated machines. It offers the best privacy of all invisible captchas, only running light browser environment checks, but applies real cost to innocent users, draining mobile batteries or slowing down on-page user experience for older devices.
Hybrid solutions like Cloudflare Tungsten combine moderate profiling with some proof of work, to strike a balance between device load and privacy, while still maintaining a decent detection rate.

Invisible captchas either come with heavy legal/privacy caveats or significant implementation complexity. However, as of now, they are the only solution that can effectively combat every type of spam bot, even ai-based agentic browsers.

Picking the right one

From an efficiency standpoint, defense #1 wins by a long shot: It keeps most bots out with practically no performance penalty, minimal storage costs and entirely dodged legal privacy concerns, without any friction to human visitors.

That said, it falls short against attackers willing to spend more resources on your page, where you have to make a trade-off: either pick a visual captcha and get balanced privacy for high user friction/frustration, but keeping all bots except ai-enabled browsers out.

Or, use an invisible captcha, with group one being a nightmare for privacy (and thus, your legal requirements in the EU), or group two applying real cost to the devices of innocent visitors instead.

The choice heavily depends on your specific requirements, so there is no real "one size fits all" solution. Understand the trade-offs, and pick the one that solves your problems with drawbacks you feel comfortable with.