Tools / Crawlee
Visit official site north_east

Crawlee

Modern web scraping and browser automation library for Node.js. Built by the team behind Apify, Crawlee is designed to handle everything from simple scraping to large-scale crawling with ease.

Features

Multiple Crawling Modes

  • HTTP Crawler: Fast scraping with Cheerio for static content
  • JSDOM Crawler: Scrape sites that need basic JavaScript execution
  • Playwright Crawler: Full browser automation for complex SPAs
  • Puppeteer Crawler: Alternative headless browser option

Built-in Intelligence

  • Auto-scaling: Automatically manages concurrency based on available resources
  • Request Queue: Persistent queue for URLs to be crawled
  • Dataset Storage: Store scraped data locally or in the cloud
  • Session Management: Handle cookies, authentication, and sessions
  • Proxy Rotation: Built-in proxy management and rotation
  • Retry Logic: Automatic retries with exponential backoff
  • Rate Limiting: Respect robots.txt and avoid overloading servers

Developer Experience

  • TypeScript First: Full TypeScript support with excellent types
  • Error Handling: Comprehensive error handling and logging
  • Testing Tools: Built-in utilities for testing scrapers
  • Monitoring: Track performance and progress in real-time
  • CLI Tools: Command-line tools for project generation

Quick Start

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, enqueueLinks }) {
        // Extract data from the page
        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            price: document.querySelector('.price')?.textContent,
        }));

        // Save the data
        await Dataset.pushData(data);

        // Find and enqueue more links
        await enqueueLinks({
            selector: 'a.product-link',
        });
    },
});

await crawler.run(['https://example.com']);

Use Cases

E-commerce

  • Product catalog scraping
  • Price monitoring and comparison
  • Review collection
  • Inventory tracking

Data Collection

  • Lead generation
  • Market research
  • Content aggregation
  • Competitive intelligence

Automation

  • Testing and QA
  • Screenshot generation
  • Form submission
  • PDF generation

Monitoring

  • Website change detection
  • SEO monitoring
  • Content validation
  • Availability checking

Key Advantages

vs Cheerio

  • Handles JavaScript-rendered content
  • Manages complex workflows
  • Built-in queue and storage

vs Puppeteer

  • Higher-level abstractions
  • Auto-scaling and resource management
  • Easier proxy and session handling
  • Better error recovery

vs Selenium

  • Better performance
  • Modern API design
  • Built for scraping workflows
  • Lower resource usage

Architecture

Request Queue → Crawler → Request Handler → Dataset
     ↓              ↓              ↓            ↓
  URLs to       Auto-scaling   Your code    Scraped
   process      & retries      extracts      data
                               data

Storage Options

  • Local Storage: Files on disk during development
  • Apify Platform: Cloud storage for production
  • Custom Storage: Implement your own storage backend

Best Practices

  • Start with CheerioCrawler for simple sites
  • Use PlaywrightCrawler only when needed (it's slower)
  • Implement proper rate limiting
  • Use session pools for authenticated scraping
  • Save data incrementally, not all at once
  • Monitor resource usage in production
  • Respect robots.txt and website ToS

Pricing

  • Open Source: Free to use locally
  • Apify Platform: Pay-as-you-go cloud execution
    • Free tier available
    • Scales based on usage

Best For

  • Node.js/TypeScript developers
  • Large-scale web scraping projects
  • Projects needing both HTTP and browser crawling
  • Teams wanting reliable, production-ready scrapers
  • Developers already using Apify
  • Projects requiring sophisticated proxy rotation

Crawlee is the most modern and developer-friendly web scraping library for Node.js, abstracting away the complexity of large-scale crawling while giving you full control when needed.

Ready to get started? Visit the official site to learn more.

Visit official site north_east
An unhandled error has occurred. Reload