Blog Content Scraper
Extract blog content from any website for LLM consumption. Zero API costs.
The Problem with LLM-Based Content Extraction
I needed to feed blog articles to LLMs for analysis and summarization, but LLM-based extraction services charge per request for something deterministic parsers handle better. Blog posts follow predictable HTML structures with article tags, metadata schemas, and content wrappers. Paying AI API costs to extract structured content felt wasteful when Mozilla Readability and Cheerio could parse DOM reliably. I wanted zero external dependencies, stateless processing, and validation that separated quality content from scraped junk.
Three-Tier Filtering for Content Quality
I built a filtering pipeline with URL deny patterns (ads, trackers, social media embeds), content validation requiring minimum 200 characters, and quality scoring across five dimensions. The scorer weighs content presence at 60%, publication date at 12%, author attribution at 8%, schema.org markup at 8%, and reading time estimation at 12%. This composite score separates legitimate articles from landing pages, navigation fragments, and promotional material. Only content scoring above threshold enters the extraction pipeline.
Validation Against Dragnet Benchmark
I tested accuracy against the Dragnet dataset containing 414 articles with ground truth annotations. The scraper achieved 92.2% F1 score, balancing precision and recall across diverse blog platforms. Mozilla Readability handles the heavy lifting for article extraction, while Cheerio manages metadata parsing and structural analysis. Playwright runs headless for JavaScript-heavy sites requiring rendering, but most extractions use lightweight HTTP requests. Zod schemas validate output structure and enforce type safety across the pipeline.
LLM-Ready Output with Token Counting and Chunking
Version 0.4.0 added LLM integration features. Token counting uses tiktoken to estimate costs before sending content to GPT models, preventing budget surprises. Chunking splits long articles at semantic boundaries (headings, paragraphs) to fit context windows while preserving narrative flow. Metadata includes source URL, publication date, author, reading time, and extraction timestamp, giving LLMs full context for analysis. Output formats support JSON, Markdown, and plain text depending on downstream consumption needs.
Caching, Batch Processing, and Framework Integration
The caching layer uses TTL-based invalidation to avoid re-scraping unchanged content. Batch processing handles multiple URLs with rate limiting to respect server resources and avoid IP blocks. I published React hooks (useScraper) for client-side integration and Express middleware for server-side routing. The npm package (@tyroneross/blog-scraper) exports 8 paths covering core extraction, LLM utilities, batch operations, caching, validation, testing helpers, debug tooling, and framework adapters. This modularity lets developers import only needed functionality.
Stateless Architecture and Zero External Costs
The scraper runs entirely in-process with no external API dependencies. Stateless design means horizontal scaling works without coordination overhead. Deployment options include serverless functions, Docker containers, or traditional Node servers. No API keys, no rate limits from third parties, no variable costs per request. The trade-off is handling HTML parsing complexity and site-specific quirks, but for blog content following common patterns, deterministic parsing wins over probabilistic LLM extraction.
What I Built
Blog Content Scraper extracts article text, metadata, and structure from any blog using Mozilla Readability and Cheerio. Three-tier filtering ensures quality, validated at 92.2% F1 score against Dragnet benchmarks. Version 0.4.0 adds LLM-ready output with token counting, semantic chunking, caching, batch processing, and React/Express integrations. Published as an npm package with modular exports, it delivers zero-cost extraction for feeding blog content to language models.