Multi-Page Crawling

With the fundamentals of both static and dynamic content scraping under our belt, it's time to tackle a more comprehensive challenge: multi-page crawling. This section focuses on efficiently navigating and extracting data from websites with multiple interconnected pages.

There are two main approaches to crawling multi-page websites:

Link-based crawling - Following links between pages
Sitemap-based crawling - Using the sitemap.xml file

For sitemap crawling, most websites provide a sitemap.xml file that lists all important URLs. This structured XML file includes:

Page URLs
Last modified dates
Change frequency
Priority values

Using the sitemap can be more efficient than link crawling since it:

Provides a complete list of pages upfront
Includes metadata about page importance and freshness
Avoids crawling unnecessary pages
Reduces server load

But for this chapter, we'll focus on link-based crawling using Crawlee to build a crawler for a multi-page e-commerce site. Crawlee handles many of the complexities of web crawling for us, including:

Automatic queue management and URL deduplication
Built-in rate limiting and retry logic
Configurable request handling and routing
Data storage and export

The site structure we'll be crawling looks like this:

Homepage
├── Category: Electronics
│   ├── Phones
│   ├── Laptops
│   └── Accessories
├── Category: Clothing
│   ├── Men's
│   └── Women's
└── Featured Products

Each product page has different layouts depending on the category, but we need to extract consistent information:

// Example data structure we want to build
interface ProductData {
  name: string;
  price: number;
  rating: { score: number, count: number };
  features: string[];
  status: string; // In Stock, Out of Stock, etc.
}

interface ResultData {
  categories: {
    electronics: {
      phones: ProductData[];
      laptops: ProductData[];
      accessories: ProductData[];
    };
    clothing: {
      mens: {
        shirts: ProductData[];
        pants: ProductData[];
      };
      womens: {
        dresses: ProductData[];
        tops: ProductData[];
      };
    };
  };
  featured_products: FeaturedProduct[];
}

Key Crawling Concepts with Crawlee

Request Queue Management

Crawlee handles the queue automatically, but here's how we configure it:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    // Handles each request
    async requestHandler({ $, request, enqueueLinks }) {
        // Process the page
        const data = extractPageData($);

        // Automatically queue new URLs found on the page
        await enqueueLinks({
            selector: 'a',
            baseUrl: request.loadedUrl,
        });
    },
    // Limit concurrent requests
    maxConcurrency: 10,
});

URL Handling

Crawlee provides built-in URL handling and normalization:

await crawler.run([startUrl]);

// Or with more configuration:
await crawler.addRequests([{
    url: startUrl,
    userData: {
        label: 'start',
    },
}]);

Route Handling

Route different URLs to specific handlers:

const crawler = new CheerioCrawler({
    async requestHandler({ $, request }) {
        const { label } = request.userData;

        switch (label) {
            case 'category':
                return handleCategory($);
            case 'product':
                return handleProduct($);
            default:
                return handleHomepage($);
        }
    },
});

Data Collection

Crawlee provides built-in storage for collected data:

const crawler = new CheerioCrawler({
    async requestHandler({ $, pushData }) {
        const productData = extractProduct($);
        await pushData(productData);
    },
});

Web Crawling Best Practices

While Crawlee handles many low-level concerns, you should still consider:

Configuration
- Set appropriate rate limits
- Configure retry strategies
- Set meaningful user-agent strings
Error Handling
- Use Crawlee's built-in error handling
- Implement custom error callbacks
- Log meaningful diagnostic information
Data Organization
- Structure your data consistently
- Use request labels for routing
- Leverage Crawlee's dataset features
Resource Management
- Configure maxConcurrency appropriately
- Use maxRequestsPerCrawl when needed
- Monitor memory usage

The Challenge

Your task is to build a Crawlee-based crawler that:

Starts at the homepage and discovers all product categories
Visits each category and subcategory page
Extracts product information from each listing
Organizes data into a structured format
Handles products that appear in multiple places (e.g., featured and category)

The site contains approximately 25-30 products across different categories, with varying layouts and information structures. Your crawler should produce a comprehensive dataset that maintains the hierarchical relationship between categories and products.

Testing Your Solution

Test for:

Completeness: Did you find all products?
Accuracy: Is the extracted data correct?
Structure: Is the data organized properly?
Efficiency: How many requests did you make?

The solved example in _solved/chapter6/ provides a reference implementation using Crawlee. Study it to understand how to leverage the library's features for efficient multi-page crawling and data organization.

Happy crawling!