Expand description
§spider-lib
spider-lib is the easiest way to use this workspace as an application
framework. It re-exports the crawler runtime, common middleware and
pipelines, shared request and response types, and the #[scraped_item]
macro behind one crate.
If you want the lower-level pieces individually, the workspace also exposes
spider-core, spider-middleware, spider-pipeline, spider-downloader,
spider-macro, and spider-util. Most users should start here.
§What you get from the facade crate
The root crate is optimized for application authors:
preludere-exports the common types needed to define and run a spiderSpiderdescribes crawl behaviorCrawlerBuilderassembles the runtimeRequest,Response,ParseContext, andParseOutputare the core runtime typesResponse::cssprovides Scrapy-like builtin selectors- middleware and pipelines can be enabled with feature flags and then added through the builder
§Installation
[dependencies]
spider-lib = "4.0.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"serde and serde_json are required when you use scraped_item.
§Quick start
use spider_lib::prelude::*;
#[scraped_item]
struct Quote {
text: String,
author: String,
source_url: String,
}
struct QuotesSpider;
#[async_trait]
impl Spider for QuotesSpider {
type Item = Quote;
type State = ();
fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
}
async fn parse(&self, cx: ParseContext<'_, Self>) -> Result<(), SpiderError> {
for quote in cx.css(".quote")? {
let text = quote
.css(".text::text")?
.get()
.unwrap_or_default();
let author = quote
.css(".author::text")?
.get()
.unwrap_or_default();
cx.add_item(Quote {
text,
author,
source_url: cx.url.to_string(),
})
.await?;
}
if let Some(next_href) = cx.css("li.next a::attr(href)")?.get() {
let next_url = cx.url.join(&next_href)?;
cx.add_request(Request::new(next_url)).await?;
}
Ok(())
}
}
#[tokio::main]
async fn main() -> Result<(), SpiderError> {
let crawler = CrawlerBuilder::new(QuotesSpider)
.log_level(LevelFilter::Info)
.build()
.await?;
crawler.start_crawl().await
}The built-in selector API is the recommended path for HTML extraction:
cx.css(".card")?, node.css("a::attr(href)")?.get(), and
node.css(".title::text")?.get().
Spider::parse takes &self and a single ParseContext parameter.
That design keeps the spider itself immutable while still giving parse logic
access to the current response, shared state, and async output methods.
§Typical next steps
After the minimal spider works, the next additions are usually:
- add one or more middleware with
CrawlerBuilder::add_middleware - add one or more pipelines with
CrawlerBuilder::add_pipeline - move repeated parse-time state into
Spider::State - enable optional features such as
live-stats,pipeline-csv, ormiddleware-robots
If you find yourself needing transport-level customization, custom
middleware contracts, or lower-level runtime control, move down to the
crate-specific APIs in spider-core, spider-downloader,
spider-middleware, or spider-pipeline.
Re-exports§
Modules§
- prelude
- Convenient re-exports for
spider-libapplications.
Macros§
- route_
by_ rule - Routes parse logic based on the discovery rule name attached to a response.
Attribute Macros§
- scraped_
item - Attribute macro for defining a scraped item type.