π² Beautiful Soup Speedrun #
Your ultra-quick guide to web scraping and data cleaning in Python.
π οΈ 1. Installation #
Beautiful Soup works with your favorite parser. It’s common to use it with the requests library to fetch web pages.
# Install Beautiful Soup and the requests library
pip install beautifulsoup4 requests
π 2. Fetch & Parse a Web Page #
First, you get the HTML content of a page, then you parse it into a BeautifulSoup object.
import requests
from bs4 import BeautifulSoup
# The URL of the page you want to scrape
URL = "https://realpython.github.io/fake-jobs/"
# Fetch the content of the page
page = requests.get(URL)
# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(page.content, "html.parser")
π 3. Find Elements #
You can find elements by their HTML tag, ID, or CSS class.
Find a single element by ID #
This is great for finding unique containers.
# Find the div with the id 'ResultsContainer'
results_container = soup.find(id="ResultsContainer")
# print(results_container.prettify())
Find all elements by tag and class #
This returns a list of all matching elements.
# Find all div elements with the class 'card-content'
job_cards = results_container.find_all("div", class_="card-content")
for job in job_cards:
# ... do something with each job card
pass
π§Ή 4. Extract Data #
Once you’ve found your elements, you can extract the text or attributes from them.
Extracting Text #
Use the .text attribute and .strip() to clean up whitespace.
for job in job_cards:
title_element = job.find("h2", class_="title")
company_element = job.find("h3", class_="company")
location_element = job.find("p", class_="location")
print(title_element.text.strip())
print(company_element.text.strip())
print(location_element.text.strip())
print() # Add a blank line
Extracting Attributes (like links) #
Access attributes like href using dictionary-style square brackets.
for job in job_cards:
# The 'Apply' link is the second 'a' tag
apply_link = job.find_all("a")[1]["href"]
print(f"Apply here: {apply_link}")