Responsible web scraping considerations

Table of contents

When you need data from a remote service, the first thought is of course to look for an API. But what if there is none, or it does not contain the data you need? This is where you may consider scraping the content of the user interface of a website for the information you need instead. While this can be the solution to your problem, it also comes with a lot of questions: Am i allowed to scrape this website? Can i use this data legally? What should i watch out for? Today we look at these questions and more on an abstract level.

Scraping is a common practice

Despite it's reputation for being "illegal", web scraping is actually a widely employed practice by some of the largest companies on the planet, in plain sight. Here is a list of some common use cases for it:

  • Website indexing: When Google, Bing or DuckDuckGo visit your website, they scrape the content off it to index it for searching. This is web scraping in it's truest form: They access your website and extract the displayed information, without asking for permission beforehand (yes there is robots.txt, we will get to that).
  • Link previews: Social networks like Facebook and Twitter, and messaging apps like WhatsApp and Discord will generate a preview of a website when sending a link. This happens by scraping metadata from the target page, typically HTML <meta> elements, but may also go beyond that, for example to embed an image directly in Discord, it will download that image and store it on Discord's servers (if caching rules allow that).
  • Market research: Marketing companies like Semrush and Ahrefs use web scraping to get a better view of competitors and provide actionable advice to customers on how to better compete in their market segment.

Legal limitations of scraping data

While there is no law against scraping a website directly, the data itself may be protected by laws. Before even beginning to retrieve remote content, stop for a moment and consider if the data you are about to fetch falls into one of these categories:

  • Private content: The file robots.txt will contain information for automated processing of the website, specifically which parts to exclude. Scraping targets explicitly forbidden for robots may be considered illegal, especially when further processing that data internally or even publishing it. Also check the terms of service and imprint, if available, for notes on scraping or other automated retrieval of information. If it is explicitly forbidden in one of these documents, then you cannot fetch or process any of the website's content by scraping.
  • Copyrighted material: Images, blog articles or videos are typically protected by copyright. Depending on how you use them, you may need the consent of the copyright holder (or a license) to process this kind of content without getting into legal conflicts.
  • Personal information: Privacy protection laws, such as the GDPR in the EU, protect the information about real persons. Such data may be anything that directly identifies one single person, such as names, unique IDs, emails etc. This kind of data requires either the consent of the person before fetching it, or a 'legitimate interest' of the company, outweighing the the individual's right to privacy. While the last part sounds easy to do, it is actually very difficult to fall into this category, because it is intended for edge cases like temporarily storing user IP addresses to prevent fraud and security concerns, and even that only under restrictions.

Web request etiquette

When making web scraping requests, remember that the server hosting that content does not solely exist for you. While it's pages may be publicly available, careless scraping can still negatively affect their service or even cause outages. Try to follow roughly the same rules as a normal user's browser would:

  • Set a User-Agent HTTP header: Preferrably one of a real browser. It may prevent you from falsely getting flagged as a spam or attack bot.
  • Accept and use cookies: Websites often do some individual processing on first visit, such as generating and assigning session IDs, storing settings etc. By accepting cookies and sending them on subsequent requests like a normal browser, you lower the amount of traffic between you and the server in future requests and reduce the amount of load you cause on the remote server.
  • Set a delay between requests: Enforce a delay between the HTTP requests you make, to ensure the real visitors of the website aren't negatively impacted by your scraping. A sane lower bound is one request per second, a conservative value would be one every 3-5 seconds, depending on the target's infrastructure. An automated web scraper can run for hours or days while you do something else, there is no need to risk a lawsuit for causing a server outage just to get your data 2 hours earlier.

Website protections

Some websites may use active protective measures against bots. I cannot stress enough that circumventing these is a very poor choice legally, and defending that position in a court room would be close to impossible, as you have actively encountered and intentionally circumvented a restriction of automated processing, to automatically process the content. Some common scenarios of this include:

  • Rate limits and IP bans: If a site deemed your requests a temporarily or permanently undesirable for any reason, that's that. Getting an IP ban means wait until the ban gets lifted. When receiving a rate limit (often HTTP status code 429), wait for the duration specified, commonly through HTTP headers like Retry-After. If no duration was given, fall back to an exponential backoff algorithm or similar.
  • Captchas: The sole purpose of captchas is to prevent automated access to specific content. This fully applies to web scraping as well: While it is technically feasable to solve it by hand and reuse the session cookie from your scraper, it is definitely not legal, as it actively circumvents protective measures against the type of processing you do.
  • Protection systems and Web Application Firewalls: A step above captchas are Web application Firewalls (WAF) and security proxies, such as Sucuri or CloudFlare. While it is trivial to circumvent these with a similar approach as with captchas, the same rule applies: Actively circumventing a security measure is far outside of legal boundaries.

As you can see, while web scraping may not be illegal, there are a lot of conditions and edge cases to think about, even before write the first line of your scraper's code. By considering these points, you can minimize the risk of ever getting negative feedback on your scraping efforts (or worse, get into legal trouble because of them). Scraping can be a wonderful and efficient tool for legitimate businesses when done right, and the negative stigmata around it has no place in the modern web anymore. Happy scraping!

More articles

Documenting HTTP APIs with OpenAPI

Make your API documentation more user-friendly and streamline the writing process

Software versioning explained

Discussing pros and cons of different approaches to handling software versions

Managing tickets with discord forums and PHP

More flexible automation with Discord's forum channels

Working with tar archives

...and tar.gz, tar.bz2 and tar.xz

Understanding the linux tee command

Copying stdin to multiple outputs

Protecting linux servers from malware with ClamAV and rkhunter

Finding malicious files the open source way