TL;DR Summary of Extracting Clean Markdown Content from Websites Using Screaming Frog SEO Spider
Optimixed’s Overview: Streamlining Website Content Extraction into Markdown for SEO and AI Applications
Introduction to Markdown Extraction in SEO Workflows
Modern SEO and AI-powered workflows benefit greatly from converting web pages into markdown, a lightweight format that retains structural elements like headings, lists, and emphasis while removing extraneous HTML code. This results in smaller, cleaner content that reduces token usage in language models and is widely compatible with embedding and fine-tuning frameworks.
Why Markdown is Ideal for AI and SEO
- Efficiency: Markdown is compact compared to bloated HTML, saving API costs and improving context window usage.
- Structure Preservation: Maintains key content elements, aiding LLM understanding.
- Compatibility: Most modern LLMs are trained on markdown or similar formats.
- Readability: Easily reviewed in text editors, facilitating quality checks before processing.
Two Proven Approaches to Extract Markdown Using Screaming Frog
1. Readability.js + Turndown (Quick and Automated)
This method integrates Mozilla’s Readability.js to automatically detect the main content area of a page and Turndown to convert HTML to markdown. It requires minimal setup, running as a Custom JavaScript snippet during crawl with JavaScript rendering enabled. It produces clean markdown with YAML frontmatter metadata, ideal for most sites and scalable for large crawls.
2. Visual Custom Extraction + Turndown (Precise and Configurable)
For sites where Readability struggles or when only specific page sections are needed, this method uses Screaming Frog’s Visual Custom Extraction to select content via a CSS selector. The snippet then strips unwanted elements and converts the selected HTML to markdown. This approach demands a bit more configuration but ensures precise control over extracted content, suited for consistent templates or complex layouts.
Bulk Exporting Markdown Files with Python
After crawling, exporting the Custom JavaScript data as an Excel file allows batch conversion of markdown content into individual files. A provided Python script processes each URL and markdown pair, generating well-named .md files with source URL comments. This facilitates seamless integration with vector databases, fine-tuning datasets, static site generators, and other AI tools.
Best Practices and Additional Tips
- Enable JavaScript rendering in Screaming Frog to allow content scripts to run correctly.
- Use Readability.js method first for broad extraction; fallback to visual extraction for precision.
- Customize
STRIP_SELECTORSin the visual extraction snippet to exclude non-content elements unique to your site. - Segment large crawls or focus on clean URL subsets to optimize performance.
- Regularly review output markdown for accuracy, especially after site redesigns.
Conclusion
By leveraging Screaming Frog’s Custom JavaScript capabilities combined with open-source libraries and a simple Python export script, SEOs and AI practitioners can efficiently generate clean markdown datasets at scale. This workflow supports a variety of use cases including knowledge base creation, training data preparation, content audits, and migration planning. The modular approach offers a balance between automation and customization, adaptable to diverse website architectures and project needs.