🍲 Beautiful Soup Speedrun

🍲 Beautiful Soup Speedrun #

Your ultra-quick guide to web scraping and data cleaning in Python.

πŸ› οΈ 1. Installation #

Beautiful Soup works with your favorite parser. It’s common to use it with the requests library to fetch web pages.

# Install Beautiful Soup and the requests library
pip install beautifulsoup4 requests

🌐 2. Fetch & Parse a Web Page #

First, you get the HTML content of a page, then you parse it into a BeautifulSoup object.

import requests
from bs4 import BeautifulSoup

# The URL of the page you want to scrape
URL = "https://realpython.github.io/fake-jobs/"

# Fetch the content of the page
page = requests.get(URL)

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(page.content, "html.parser")

πŸ”Ž 3. Find Elements #

You can find elements by their HTML tag, ID, or CSS class.

Find a single element by ID #

This is great for finding unique containers.

# Find the div with the id 'ResultsContainer'
results_container = soup.find(id="ResultsContainer")
# print(results_container.prettify())

Find all elements by tag and class #

This returns a list of all matching elements.

# Find all div elements with the class 'card-content'
job_cards = results_container.find_all("div", class_="card-content")

for job in job_cards:
    # ... do something with each job card
    pass

🧹 4. Extract Data #

Once you’ve found your elements, you can extract the text or attributes from them.

Extracting Text #

Use the .text attribute and .strip() to clean up whitespace.

for job in job_cards:
    title_element = job.find("h2", class_="title")
    company_element = job.find("h3", class_="company")
    location_element = job.find("p", class_="location")
    
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print() # Add a blank line

Access attributes like href using dictionary-style square brackets.

for job in job_cards:
    # The 'Apply' link is the second 'a' tag
    apply_link = job.find_all("a")[1]["href"]
    print(f"Apply here: {apply_link}")