Using the Screaming Frog SEO Spider to Generate Markdown at Scale – Screaming Frog

Source: Screaming Frog by Mark Porter. Read the original article

TL;DR Summary of Extracting Clean Markdown Content from Websites Using Screaming Frog SEO Spider

Markdown extraction streamlines SEO workflows by converting web pages into clean, structured text ideal for AI tools. The Screaming Frog SEO Spider leverages Custom JavaScript with libraries like Readability.js and Turndown to automate content extraction at scale. Two methods are offered: a quick, no-setup approach and a customizable selector-based approach for precise content targeting. A Python script then bulk exports markdown files from crawl data, enabling efficient downstream use in AI pipelines, audits, or migrations.

Optimixed’s Overview: Streamlining Website Content Extraction into Markdown for SEO and AI Applications

Introduction to Markdown Extraction in SEO Workflows

Modern SEO and AI-powered workflows benefit greatly from converting web pages into markdown, a lightweight format that retains structural elements like headings, lists, and emphasis while removing extraneous HTML code. This results in smaller, cleaner content that reduces token usage in language models and is widely compatible with embedding and fine-tuning frameworks.

Why Markdown is Ideal for AI and SEO

Efficiency: Markdown is compact compared to bloated HTML, saving API costs and improving context window usage.
Structure Preservation: Maintains key content elements, aiding LLM understanding.
Compatibility: Most modern LLMs are trained on markdown or similar formats.
Readability: Easily reviewed in text editors, facilitating quality checks before processing.

Two Proven Approaches to Extract Markdown Using Screaming Frog

1. Readability.js + Turndown (Quick and Automated)

This method integrates Mozilla’s Readability.js to automatically detect the main content area of a page and Turndown to convert HTML to markdown. It requires minimal setup, running as a Custom JavaScript snippet during crawl with JavaScript rendering enabled. It produces clean markdown with YAML frontmatter metadata, ideal for most sites and scalable for large crawls.

2. Visual Custom Extraction + Turndown (Precise and Configurable)

For sites where Readability struggles or when only specific page sections are needed, this method uses Screaming Frog’s Visual Custom Extraction to select content via a CSS selector. The snippet then strips unwanted elements and converts the selected HTML to markdown. This approach demands a bit more configuration but ensures precise control over extracted content, suited for consistent templates or complex layouts.

Bulk Exporting Markdown Files with Python

After crawling, exporting the Custom JavaScript data as an Excel file allows batch conversion of markdown content into individual files. A provided Python script processes each URL and markdown pair, generating well-named .md files with source URL comments. This facilitates seamless integration with vector databases, fine-tuning datasets, static site generators, and other AI tools.

Best Practices and Additional Tips

Enable JavaScript rendering in Screaming Frog to allow content scripts to run correctly.
Use Readability.js method first for broad extraction; fallback to visual extraction for precision.
Customize STRIP_SELECTORS in the visual extraction snippet to exclude non-content elements unique to your site.
Segment large crawls or focus on clean URL subsets to optimize performance.
Regularly review output markdown for accuracy, especially after site redesigns.

Conclusion

By leveraging Screaming Frog’s Custom JavaScript capabilities combined with open-source libraries and a simple Python export script, SEOs and AI practitioners can efficiently generate clean markdown datasets at scale. This workflow supports a variety of use cases including knowledge base creation, training data preparation, content audits, and migration planning. The modular approach offers a balance between automation and customization, adaptable to diverse website architectures and project needs.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30