Crawlee
Modern web scraping and browser automation library for Node.js. Built by the team behind Apify, Crawlee is designed to handle everything from simple scraping to large-scale crawling with ease.
Features
Multiple Crawling Modes
- HTTP Crawler: Fast scraping with Cheerio for static content
- JSDOM Crawler: Scrape sites that need basic JavaScript execution
- Playwright Crawler: Full browser automation for complex SPAs
- Puppeteer Crawler: Alternative headless browser option
Built-in Intelligence
- Auto-scaling: Automatically manages concurrency based on available resources
- Request Queue: Persistent queue for URLs to be crawled
- Dataset Storage: Store scraped data locally or in the cloud
- Session Management: Handle cookies, authentication, and sessions
- Proxy Rotation: Built-in proxy management and rotation
- Retry Logic: Automatic retries with exponential backoff
- Rate Limiting: Respect robots.txt and avoid overloading servers
Developer Experience
- TypeScript First: Full TypeScript support with excellent types
- Error Handling: Comprehensive error handling and logging
- Testing Tools: Built-in utilities for testing scrapers
- Monitoring: Track performance and progress in real-time
- CLI Tools: Command-line tools for project generation
Quick Start
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request, enqueueLinks }) {
// Extract data from the page
const data = await page.evaluate(() => ({
title: document.querySelector('h1')?.textContent,
price: document.querySelector('.price')?.textContent,
}));
// Save the data
await Dataset.pushData(data);
// Find and enqueue more links
await enqueueLinks({
selector: 'a.product-link',
});
},
});
await crawler.run(['https://example.com']);
Use Cases
E-commerce
- Product catalog scraping
- Price monitoring and comparison
- Review collection
- Inventory tracking
Data Collection
- Lead generation
- Market research
- Content aggregation
- Competitive intelligence
Automation
- Testing and QA
- Screenshot generation
- Form submission
- PDF generation
Monitoring
- Website change detection
- SEO monitoring
- Content validation
- Availability checking
Key Advantages
vs Cheerio
- Handles JavaScript-rendered content
- Manages complex workflows
- Built-in queue and storage
vs Puppeteer
- Higher-level abstractions
- Auto-scaling and resource management
- Easier proxy and session handling
- Better error recovery
vs Selenium
- Better performance
- Modern API design
- Built for scraping workflows
- Lower resource usage
Architecture
Request Queue → Crawler → Request Handler → Dataset
↓ ↓ ↓ ↓
URLs to Auto-scaling Your code Scraped
process & retries extracts data
data
Storage Options
- Local Storage: Files on disk during development
- Apify Platform: Cloud storage for production
- Custom Storage: Implement your own storage backend
Best Practices
- Start with CheerioCrawler for simple sites
- Use PlaywrightCrawler only when needed (it's slower)
- Implement proper rate limiting
- Use session pools for authenticated scraping
- Save data incrementally, not all at once
- Monitor resource usage in production
- Respect robots.txt and website ToS
Pricing
- Open Source: Free to use locally
- Apify Platform: Pay-as-you-go cloud execution
- Free tier available
- Scales based on usage
Best For
- Node.js/TypeScript developers
- Large-scale web scraping projects
- Projects needing both HTTP and browser crawling
- Teams wanting reliable, production-ready scrapers
- Developers already using Apify
- Projects requiring sophisticated proxy rotation
Crawlee is the most modern and developer-friendly web scraping library for Node.js, abstracting away the complexity of large-scale crawling while giving you full control when needed.
Ready to get started? Visit the official site to learn more.
Visit official site north_east