Spider

Trait Spider 

pub trait Spider:
    Send
    + Sync
    + 'static {
    type Item: ScrapedItem;
    type State: Default + Send + Sync;

    // Required method
    fn parse<'life0, 'life1, 'async_trait>(
        &'life0 self,
        cx: ParseContext<'life1, Self>,
    ) -> Pin<Box<dyn Future<Output = Result<(), SpiderError>> + Send + 'async_trait>>
       where 'life0: 'async_trait,
             'life1: 'async_trait,
             Self: 'async_trait;

    // Provided methods
    fn start_urls(&self) -> Vec<&'static str> { ... }
    fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> { ... }
}
Expand description

Core runtime types and traits used to define and run a crawl. Defines the contract for a spider.

§Type Parameters

  • Item: The type of scraped data structure (must implement ScrapedItem)
  • State: The type of shared state (must implement Default)

§Design Notes

The trait uses &self (immutable reference) instead of &mut self for the parse method. This design enables efficient concurrent crawling by eliminating the need for mutex locks when accessing the spider from multiple async tasks. State that needs mutation should be stored in the associated State type using thread-safe primitives like Arc<AtomicUsize> or DashMap.

A typical crawl lifecycle looks like this:

  1. start_requests produces the initial requests
  2. the runtime schedules and downloads them
  3. parse receives a ParseContext for each response
  4. emitted items go to pipelines and emitted requests go back to the scheduler

Required Associated Types§

type Item: ScrapedItem

The type of item that the spider scrapes.

This associated type must implement the ScrapedItem trait, which provides methods for type erasure, cloning, and JSON serialization. Use the #[scraped_item] procedural macro to automatically implement all required traits for your data structures.

type State: Default + Send + Sync

The type of state that the spider uses.

The state type must implement Default so it can be instantiated automatically by the crawler. It should also be Send + Sync to enable safe concurrent access from multiple async tasks.

§Example
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use dashmap::DashMap;

#[derive(Clone, Default)]
struct MySpiderState {
    page_count: Arc<AtomicUsize>,
    visited_urls: Arc<DashMap<String, bool>>,
}

Required Methods§

fn parse<'life0, 'life1, 'async_trait>( &'life0 self, cx: ParseContext<'life1, Self>, ) -> Pin<Box<dyn Future<Output = Result<(), SpiderError>> + Send + 'async_trait>>
where 'life0: 'async_trait, 'life1: 'async_trait, Self: 'async_trait,

Parses a response and extracts scraped items and new requests.

§Errors

This is the primary method where scraping logic is implemented. It receives a Response object and should extract structured data (items) and/or discover new URLs to crawl (requests).

§Parameters
  • cx: A parse context containing the current response, shared spider state, and async output sink
§Returns

The provided ParseContext lets the spider stream:

  • Scraped items of type Self::Item
  • New Request objects to be enqueued

The usual pattern is:

§Design Notes

This method takes an immutable reference to self (&self) instead of mutable (&mut self), eliminating the need for mutex locks when accessing the spider in concurrent environments. State that needs to be modified should be stored in the State type using thread-safe primitives.

§Errors

Returns a SpiderError if parsing fails or if an unrecoverable error occurs during processing.

§Example
async fn parse(&self, cx: ParseContext<'_, Self>) -> Result<(), SpiderError> {
    // Parse HTML and extract data
    let heading = cx.css("h1::text")?.get().unwrap_or_default();

    Ok(())
}

Provided Methods§

fn start_urls(&self) -> Vec<&'static str>

Returns static seed URLs.

This method is optional and useful for simple spiders. The default start_requests implementation converts these URLs into a request iterator.

Prefer this method when plain URL strings are enough. Override start_requests instead when you need custom headers, methods, request metadata, seed-file loading, or dynamic seed generation.

fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError>

Returns the initial request source used to start crawling.

The default implementation converts start_urls into an iterator.

To load from seed file, return StartRequests::file(path). To use a fixed list of URL strings, return StartRequests::Urls(...). To use custom generation logic, return StartRequests::iter(...).

This method is the better override point whenever initial requests need more than a URL string, such as per-request metadata, POST bodies, or custom headers.

§Example
fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
    Ok(StartRequests::file("seeds/start_urls.txt"))
}

Dyn Compatibility§

This trait is not dyn compatible.

In older versions of Rust, dyn compatibility was called "object safety", so this trait is not object safe.

Implementors§