🍜 MD-Souping: Aggressive HTML to Markdown Conversion

🍜 MD-Souping: Aggressive HTML to Markdown Conversion #

This cheat sheet provides a guide to using the python-markdownify library for aggressively cleaning and converting HTML content into clean, readable Markdown. This is particularly useful for web scraping and content processing tasks where the source HTML is messy or contains unwanted elements.

πŸš€ Installation #

First, install the library using pip:

pip install markdownify

ΰ€¬ΰ₯‡ΰ€Έΰ€Ώΰ€• Usage #

The simplest way to convert HTML to Markdown is to use the markdownify function:

from markdownify import markdownify as md

html = '<b>Yay</b> <a href="http://github.com">GitHub</a>'
markdown = md(html)
# Output: '**Yay** [GitHub](http://github.com)'

πŸ‘‘ Advanced Pre-processing with Beautiful Soup #

For heavily bloated or poorly structured HTML, using Beautiful Soup to manually clean the document before converting is the most powerful method. This gives you precise control to remove entire sections, extract specific content, and modify the tree.

First, install Beautiful Soup and a parser:

pip install beautifulsoup4 lxml

Example: Extracting Main Content #

Let’s say you only want the content from the main <article> tag and want to remove all divs with the class ad-banner.

from bs4 import BeautifulSoup
from markdownify import markdownify as md

html = """... your bloated html ...
<header>Site Nav</header>
<article>
  <h1>Main Title</h1>
  <p>Some good content.</p>
  <div class='ad-banner'>Ugly Ad</div>
</article>
<footer>Copyright</footer>
"""

soup = BeautifulSoup(html, 'lxml')

# 1. Decompose (completely remove) unwanted elements
for ad in soup.find_all('div', class_='ad-banner'):
    ad.decompose()

# 2. Extract only the main article content
article = soup.find('article')

# 3. Convert the cleaned, extracted HTML to Markdown
if article:
    markdown = md(str(article))
    print(markdown)
    # Output: '# Main Title\n\nSome good content.'

This two-step processβ€”clean with Beautiful Soup, then convert with markdownifyβ€”is the key to handling complex, real-world HTML.

🧹 Aggressive Cleaning with markdownify Options #

markdownify provides several options to control the conversion process, allowing for aggressive cleaning of the HTML.

Stripping Unwanted Tags #

You can remove specific tags from the output using the strip option. This is useful for removing elements like ads, navigation bars, or scripts.

from markdownify import markdownify as md

html = '<h1>Title</h1><p>Some text.</p><div class="ad">An ad</div>'

# Strip the 'div' tag
markdown = md(html, strip=['div'])
# Output: '# Title\n\nSome text.'

Converting Only Specific Tags #

Conversely, you can specify only the tags you want to convert using the convert option. All other tags will be stripped.

from markdownify import markdownify as md

html = '<h1>Title</h1><p>Some text.</p><b>And some bold text.</b>'

# Convert only 'h1' and 'p' tags
markdown = md(html, convert=['h1', 'p'])
# Output: '# Title\n\nSome text.And some bold text.'

Handling Code Blocks #

markdownify can automatically convert <pre> tags into fenced code blocks. You can also provide a callback function to extract the language from the tag’s attributes.

from markdownify import markdownify as md

def get_lang(el):
    return el.get('class')[0] if el.get('class') else None

html = '<pre class="python">def hello():\n    print("Hello, World!")</pre>'

markdown = md(html, code_language_callback=get_lang)
# Output: '```python\ndef hello():\n    print("Hello, World!")\n```'

Custom Converters #

For ultimate control, you can create a custom converter by subclassing MarkdownConverter and overriding the conversion method for specific tags. This allows you to define exactly how each tag should be handled.

from markdownify import MarkdownConverter

class MyConverter(MarkdownConverter):
    def convert_p(self, el, text, parent_tags):
        # Add a custom prefix to all paragraphs
        return f"> {text}\n\n"

def md_custom(html, **options):
    return MyConverter(**options).convert(html)

html = '<p>This is a paragraph.</p>'

markdown = md_custom(html)
# Output: '> This is a paragraph.\n\n'

ε‘½δ»€θ‘Œ Interface #

markdownify also provides a command-line interface for quick conversions:

# Convert a file
markdownify input.html > output.md

# Pipe from stdin
cat input.html | markdownify > output.md

# See all options
markdownify -h

By combining these techniques, you can create powerful and flexible scripts for converting even the most complex and messy HTML into clean, well-structured Markdown.