Crawlee – Dotnet Indie Stack

Modern web scraping and browser automation library for Node.js. Built by the team behind Apify, Crawlee is designed to handle everything from simple scraping to large-scale crawling with ease.

Features

Multiple Crawling Modes

HTTP Crawler: Fast scraping with Cheerio for static content
JSDOM Crawler: Scrape sites that need basic JavaScript execution
Playwright Crawler: Full browser automation for complex SPAs
Puppeteer Crawler: Alternative headless browser option

Built-in Intelligence

Auto-scaling: Automatically manages concurrency based on available resources
Request Queue: Persistent queue for URLs to be crawled
Dataset Storage: Store scraped data locally or in the cloud
Session Management: Handle cookies, authentication, and sessions
Proxy Rotation: Built-in proxy management and rotation
Retry Logic: Automatic retries with exponential backoff
Rate Limiting: Respect robots.txt and avoid overloading servers

Developer Experience

TypeScript First: Full TypeScript support with excellent types
Error Handling: Comprehensive error handling and logging
Testing Tools: Built-in utilities for testing scrapers
Monitoring: Track performance and progress in real-time
CLI Tools: Command-line tools for project generation

Quick Start

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, enqueueLinks }) {
        // Extract data from the page
        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            price: document.querySelector('.price')?.textContent,
        }));

        // Save the data
        await Dataset.pushData(data);

        // Find and enqueue more links
        await enqueueLinks({
            selector: 'a.product-link',
        });
    },
});

await crawler.run(['https://example.com']);

Use Cases

E-commerce

Product catalog scraping
Price monitoring and comparison
Review collection
Inventory tracking

Data Collection

Lead generation
Market research
Content aggregation
Competitive intelligence

Automation

Testing and QA
Screenshot generation
Form submission
PDF generation

Monitoring

Website change detection
SEO monitoring
Content validation
Availability checking

Key Advantages

vs Cheerio

Handles JavaScript-rendered content
Manages complex workflows
Built-in queue and storage

vs Puppeteer

Higher-level abstractions
Auto-scaling and resource management
Easier proxy and session handling
Better error recovery

vs Selenium

Better performance
Modern API design
Built for scraping workflows
Lower resource usage

Architecture

Request Queue → Crawler → Request Handler → Dataset
     ↓              ↓              ↓            ↓
  URLs to       Auto-scaling   Your code    Scraped
   process      & retries      extracts      data
                               data

Storage Options

Local Storage: Files on disk during development
Apify Platform: Cloud storage for production
Custom Storage: Implement your own storage backend

Best Practices

Start with CheerioCrawler for simple sites
Use PlaywrightCrawler only when needed (it's slower)
Implement proper rate limiting
Use session pools for authenticated scraping
Save data incrementally, not all at once
Monitor resource usage in production
Respect robots.txt and website ToS

Pricing

Open Source: Free to use locally
Apify Platform: Pay-as-you-go cloud execution
- Free tier available
- Scales based on usage

Best For

Node.js/TypeScript developers
Large-scale web scraping projects
Projects needing both HTTP and browser crawling
Teams wanting reliable, production-ready scrapers
Developers already using Apify
Projects requiring sophisticated proxy rotation

Crawlee is the most modern and developer-friendly web scraping library for Node.js, abstracting away the complexity of large-scale crawling while giving you full control when needed.