Trait Spider
pub trait Spider:
Send
+ Sync
+ 'static {
type Item: ScrapedItem;
type State: Default + Send + Sync;
// Required method
fn parse<'life0, 'life1, 'async_trait>(
&'life0 self,
cx: ParseContext<'life1, Self>,
) -> Pin<Box<dyn Future<Output = Result<(), SpiderError>> + Send + 'async_trait>>
where 'life0: 'async_trait,
'life1: 'async_trait,
Self: 'async_trait;
// Provided methods
fn start_urls(&self) -> Vec<&'static str> { ... }
fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> { ... }
}Expand description
Core runtime types and traits used to define and run a crawl. Defines the contract for a spider.
§Type Parameters
Item: The type of scraped data structure (must implementScrapedItem)State: The type of shared state (must implementDefault)
§Design Notes
The trait uses &self (immutable reference) instead of &mut self for the
parse method. This design enables efficient concurrent crawling
by eliminating the need for mutex locks when accessing the spider from multiple
async tasks. State that needs mutation should be stored in the associated
State type using thread-safe primitives like Arc<AtomicUsize> or DashMap.
A typical crawl lifecycle looks like this:
start_requestsproduces the initial requests- the runtime schedules and downloads them
parsereceives aParseContextfor each response- emitted items go to pipelines and emitted requests go back to the scheduler
Required Associated Types§
type Item: ScrapedItem
type Item: ScrapedItem
The type of item that the spider scrapes.
This associated type must implement the ScrapedItem trait, which
provides methods for type erasure, cloning, and JSON serialization.
Use the #[scraped_item] procedural macro to automatically implement
all required traits for your data structures.
type State: Default + Send + Sync
type State: Default + Send + Sync
The type of state that the spider uses.
The state type must implement Default so it can be instantiated
automatically by the crawler. It should also be Send + Sync to
enable safe concurrent access from multiple async tasks.
§Example
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use dashmap::DashMap;
#[derive(Clone, Default)]
struct MySpiderState {
page_count: Arc<AtomicUsize>,
visited_urls: Arc<DashMap<String, bool>>,
}Required Methods§
fn parse<'life0, 'life1, 'async_trait>(
&'life0 self,
cx: ParseContext<'life1, Self>,
) -> Pin<Box<dyn Future<Output = Result<(), SpiderError>> + Send + 'async_trait>>where
'life0: 'async_trait,
'life1: 'async_trait,
Self: 'async_trait,
fn parse<'life0, 'life1, 'async_trait>(
&'life0 self,
cx: ParseContext<'life1, Self>,
) -> Pin<Box<dyn Future<Output = Result<(), SpiderError>> + Send + 'async_trait>>where
'life0: 'async_trait,
'life1: 'async_trait,
Self: 'async_trait,
Parses a response and extracts scraped items and new requests.
§Errors
This is the primary method where scraping logic is implemented. It receives
a Response object and should extract structured data (items) and/or
discover new URLs to crawl (requests).
§Parameters
cx: A parse context containing the current response, shared spider state, and async output sink
§Returns
The provided ParseContext lets the spider stream:
- Scraped items of type
Self::Item - New
Requestobjects to be enqueued
The usual pattern is:
- read the response through the context directly, for example
cx.css(...)viaDeref - read shared state with
ParseContext::state - call
ParseContext::add_itemoradd_itemsfor scraped items - call
ParseContext::add_requestoradd_requestsfor follow-up requests - return
Ok(())when parsing is done
§Design Notes
This method takes an immutable reference to self (&self) instead of
mutable (&mut self), eliminating the need for mutex locks when accessing
the spider in concurrent environments. State that needs to be modified
should be stored in the State type using thread-safe primitives.
§Errors
Returns a SpiderError if parsing fails or if an unrecoverable error
occurs during processing.
§Example
async fn parse(&self, cx: ParseContext<'_, Self>) -> Result<(), SpiderError> {
// Parse HTML and extract data
let heading = cx.css("h1::text")?.get().unwrap_or_default();
Ok(())
}Provided Methods§
fn start_urls(&self) -> Vec<&'static str>
fn start_urls(&self) -> Vec<&'static str>
Returns static seed URLs.
This method is optional and useful for simple spiders. The default
start_requests implementation converts these
URLs into a request iterator.
Prefer this method when plain URL strings are enough. Override
start_requests instead when you need custom
headers, methods, request metadata, seed-file loading, or dynamic seed
generation.
fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError>
fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError>
Returns the initial request source used to start crawling.
The default implementation converts start_urls
into an iterator.
To load from seed file, return StartRequests::file(path).
To use a fixed list of URL strings, return StartRequests::Urls(...).
To use custom generation logic, return StartRequests::iter(...).
This method is the better override point whenever initial requests need more than a URL string, such as per-request metadata, POST bodies, or custom headers.
§Example
fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
Ok(StartRequests::file("seeds/start_urls.txt"))
}Dyn Compatibility§
This trait is not dyn compatible.
In older versions of Rust, dyn compatibility was called "object safety", so this trait is not object safe.