While the internet has become a home for any kind of data you may be looking for, it is not always as easily processable as we would like. This has given traction to the practice of web scraping, the process of automatically reading a website made for humans, filtering out the information you need and making it available for processing by a computer. Let's see why programmers commonly choose python for this task!
Before you scrape a website
If you want to learn web scraping, there are free pages made exactly for that, for example https://www.scrapethissite.com or https://toscrape.com. You can freely scrape these without worrying about any legal issues.
While scraping itself may not be illegal, it does copy text or images that you may not be allowed to due to copyright. Make sure to check the terms of service or imprint of a page before scraping it, and comply with the restrictions laid out in their robots.txt
file. Also ensure you add ample timeouts between requests, so you won't accidentally use all of the page's bandwidth on your own or cause a service outage through too many resource heavy requests.
Setting up the environment
To get started with web scraping, we first need a new virtual environment
python -m venv ./venv
To make HTTP requests, we will use the requests module:
python -m pip install requests
To parse the HTML responses we get back, BeautifulSoup4 is our tool of choice:
python -m pip install beautifulsoup4
And just like that, we are ready to scrape a page!
Scraping a website
To get started with web scraping, let's pick a simple task for now. The page https://www.scrapethissite.com/pages/simple/ contains a list of all countries in the world, complete with capitols, population and approximate size. A single country is displayed with this HTML markup:
<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
Andorra
</h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br>
<strong>Population:</strong> <span class="country-population">84000</span><br>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br>
</div>
</div>
Some observations:
- The outermost element for each country will have the
country
class - The name of a country will be the only text (non-html content) of the
h3
with the classcountry-name
- The details (capitol, population, area) will be the content of elements with class
country-capital
,country-population
andcountry-area
, respectively.
Start with a simple request:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.scrapethissite.com/pages/simple/")
if response.status_code != 200:
raise Exception(f"Received unexpected http status {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
for country in soup.find_all(class_="country"):
name = country.find("h3", class_="country-name").text
print(name)
These few lines of code make an http GET request to the page we want to scrape, check the result for errors, then parses it with BeautifulSoup.
We can now start looking for the page elements we want using the soup
variable, which holds the parsed html data of the website.
The for
loop ranges through all html elements that have a class called country
. For each of them, it finds the h3
element with class country-name
and prints the text
contents of that element. Note that the function parameter to look for an html class is named class_
(with an underscore at the end). This is intentional, because class
is a reserved keyword in python to declare object classes.
The output of this code is close to what we want, except that it has some more newlines and spaces than we want:
Andorra
United Arab Emirates
Afghanistan
...
Let's fix the for loop to call .strip() on the text contents to remove leading and trailing whitespace, and add the other country details to the output:
for country in soup.find_all(class_="country"):
output = {}
output["name"] = country.find("h3", class_="country-name").text.strip()
output["capital"] = country.find(class_="country-capital").text.strip()
output["population"] = country.find(class_="country-population").text.strip()
output["area"] = country.find(class_="country-area").text.strip()
print(output)
We have slightly modified our output to store all country data in a dict and added capital, population and area values to it.The output looks much more parseable now:
{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'}
{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}
{'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area': '647500.0'}
{'name': 'Antigua and Barbuda', 'capital': "St. John's", 'population': '86754', 'area': '443.0'}
{'name': 'Anguilla', 'capital': 'The Valley', 'population': '13254', 'area': '102.0'}
...
We could then collect all countries in a list and encode that as json, so other programs can easily process and further work with our scraped data:
import json
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.scrapethissite.com/pages/simple/")
if response.status_code != 200:
raise Exception(f"Received unexpected http status {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
countries = []
for country in soup.find_all(class_="country"):
country_data = {}
country_data["name"] = country.find("h3", class_="country-name").text.strip()
country_data["capital"] = country.find(class_="country-capital").text.strip()
country_data["population"] = country.find(class_="country-population").text.strip()
country_data["area"] = country.find(class_="country-area").text.strip()
countries.append(country_data)
print(json.dumps(countries))
This first example illustrates why python is one of the most popular choices when it comes to web scraping: in less than 20 lines of code, we made an http request, checked for errors, parsed the page html, scraped 250 countries with 4 attributes each off the page, cleaned superfluous whitespace and printed everything as a json array.
Advanced data extraction
The first example was very simple in terms of data extraction: Every piece of information we wanted was conveniently placed in it's own html element, with a unique class assigned to it. While this is great for scraping, reality is not quite as comfortable.
Let's look at an example page with some real-world scraping challenges: http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html
This page is a fictional book store meant to be be scraped (you can't actually buy anything there). Let's say we are interested in 2 types of information:
- The rating of the book from 1 to 5
- The amount of books currently in stock
The html markup for the star rating looks like this:
<p class="star-rating Five">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<!-- <small><a href="/catalogue/a-spys-devotion-the-regency-spies-of-london-1_3/reviews/">
0 customer reviews
</a></small>
-->
<!--
<a id="write_review" href="/catalogue/a-spys-devotion-the-regency-spies-of-london-1_3/reviews/add/#addreview" class="btn btn-success btn-sm">
Write a review
</a>
-->
</p>
Some commented out features, but at least a unique class star-rating
to find the element. But the rating itself is not quite as easy: the colors of the star icons are set using css classes One
, Two
, Three
, Four
and Five
for the star-rating
element. There is no easy way to turn those into integers other than comparing strings:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html")
if response.status_code != 200:
raise Exception(f"Received unexpected http status {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
rating = 0
rating_element = soup.find(class_="star-rating")
if "One" in rating_element["class"]:
rating = 1
if "Two" in rating_element["class"]:
rating = 2
if "Three" in rating_element["class"]:
rating = 3
if "Four" in rating_element["class"]:
rating = 4
if "Five" in rating_element["class"]:
rating = 5
print(rating)
This time we check if a class name is contained in the "class" list
of the BeautifulSoup html element and assign a rating integer according to which class name matched.
The second task, retrieving the amount currently in stock, turns out to be even harder to parse. Here is the html markup of the element:
<p class="instock availability">
<i class="icon-ok"></i>
In stock (20 available)
</p>
There is simply no way to retrieve the number alone with just html parsing. Issues like this are usually resolved using regular expressions (regex). The number we want will be the first digits after the opening parenthesis (
. To filter only that out of the element's text, a regex capture group comes in handy: \((\d+).*\)
To explain that in simple terms: \(
will look for the literal character (
, and (\d+)
will capture ((...)
) the first one or more (+
) digits (\d
) after it.
Adding that to the scraping script:
import re
import requests
from bs4 import BeautifulSoup
response = requests.get("http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html")
if response.status_code != 200:
raise Exception(f"Received unexpected http status {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
rating = 0
rating_element = soup.find(class_="star-rating")
if "One" in rating_element["class"]:
rating = 1
if "Two" in rating_element["class"]:
rating = 2
if "Three" in rating_element["class"]:
rating = 3
if "Four" in rating_element["class"]:
rating = 4
if "Five" in rating_element["class"]:
rating = 5
stock = re.search("\((\d+)", soup.find(class_="instock").text).group(1)
print(f"Rating: {rating}/5 ({stock} in stock)")
We simply used the builtin re
module to search for our regex pattern in the html element with class instock
and used the first capture group's data as the stock amount. Even for more complex challenges like this, python makes it easy to parse and filter the data we are looking for out of websites that were not designed with automated processing in mind.
With this introduction to web scraping in python, you are now equipped with the basic knowledge of how to retrieve information from a website that does not have an API endpoint, including the cleaning of values and even more advanced regex pattern matching.