spider_lib/
lib.rs

1//! # spider-lib
2//!
3//! `spider-lib` is the easiest way to use this workspace as an application
4//! framework. It re-exports the crawler runtime, common middleware and
5//! pipelines, shared request and response types, and the `#[scraped_item]`
6//! macro behind one crate.
7//!
8//! If you want the lower-level pieces individually, the workspace also exposes
9//! `spider-core`, `spider-middleware`, `spider-pipeline`, `spider-downloader`,
10//! `spider-macro`, and `spider-util`. Most users should start here.
11//!
12//! ## What you get from the facade crate
13//!
14//! The root crate is optimized for application authors:
15//!
16//! - [`prelude`] re-exports the common types needed to define and run a spider
17//! - [`Spider`] describes crawl behavior
18//! - [`CrawlerBuilder`] assembles the runtime
19//! - [`Request`], [`Response`], [`ParseContext`], and [`ParseOutput`] are the core runtime types
20//! - [`Response::css`](spider_util::response::Response::css) provides Scrapy-like builtin selectors
21//! - middleware and pipelines can be enabled with feature flags and then added
22//!   through the builder
23//!
24//! ## Installation
25//!
26//! ```toml
27//! [dependencies]
28//! spider-lib = "4.0.0"
29//! serde = { version = "1.0", features = ["derive"] }
30//! serde_json = "1.0"
31//! ```
32//!
33//! `serde` and `serde_json` are required when you use [`scraped_item`].
34//!
35//! ## Quick start
36//!
37//! ```rust,ignore
38//! use spider_lib::prelude::*;
39//!
40//! #[scraped_item]
41//! struct Quote {
42//!     text: String,
43//!     author: String,
44//!     source_url: String,
45//! }
46//!
47//! struct QuotesSpider;
48//!
49//! #[async_trait]
50//! impl Spider for QuotesSpider {
51//!     type Item = Quote;
52//!     type State = ();
53//!
54//!     fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
55//!         Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
56//!     }
57//!
58//!     async fn parse(&self, cx: ParseContext<'_, Self>) -> Result<(), SpiderError> {
59//!         for quote in cx.css(".quote")? {
60//!             let text = quote
61//!                 .css(".text::text")?
62//!                 .get()
63//!                 .unwrap_or_default();
64//!
65//!             let author = quote
66//!                 .css(".author::text")?
67//!                 .get()
68//!                 .unwrap_or_default();
69//!
70//!             cx.add_item(Quote {
71//!                 text,
72//!                 author,
73//!                 source_url: cx.url.to_string(),
74//!             })
75//!             .await?;
76//!         }
77//!
78//!         if let Some(next_href) = cx.css("li.next a::attr(href)")?.get() {
79//!             let next_url = cx.url.join(&next_href)?;
80//!             cx.add_request(Request::new(next_url)).await?;
81//!         }
82//!
83//!         Ok(())
84//!     }
85//! }
86//!
87//! #[tokio::main]
88//! async fn main() -> Result<(), SpiderError> {
89//!     let crawler = CrawlerBuilder::new(QuotesSpider)
90//!         .log_level(LevelFilter::Info)
91//!         .build()
92//!         .await?;
93//!     crawler.start_crawl().await
94//! }
95//! ```
96//!
97//! The built-in selector API is the recommended path for HTML extraction:
98//! `cx.css(".card")?`, `node.css("a::attr(href)")?.get()`, and
99//! `node.css(".title::text")?.get()`.
100//!
101//! [`Spider::parse`] takes `&self` and a single [`ParseContext`] parameter.
102//! That design keeps the spider itself immutable while still giving parse logic
103//! access to the current response, shared state, and async output methods.
104//!
105//! ## Typical next steps
106//!
107//! After the minimal spider works, the next additions are usually:
108//!
109//! 1. add one or more middleware with [`CrawlerBuilder::add_middleware`]
110//! 2. add one or more pipelines with [`CrawlerBuilder::add_pipeline`]
111//! 3. move repeated parse-time state into [`Spider::State`]
112//! 4. enable optional features such as `live-stats`, `pipeline-csv`, or
113//!    `middleware-robots`
114//!
115//! If you find yourself needing transport-level customization, custom
116//! middleware contracts, or lower-level runtime control, move down to the
117//! crate-specific APIs in `spider-core`, `spider-downloader`,
118//! `spider-middleware`, or `spider-pipeline`.
119
120extern crate self as spider_lib;
121
122pub mod prelude;
123/// Re-export the application-facing prelude.
124///
125/// Most examples and first integrations start with:
126///
127/// ```rust
128/// use spider_lib::prelude::*;
129/// ```
130pub use prelude::*;
131pub use log;
132pub use spider_core::route_by_rule;
133
134// Re-export procedural macros
135pub use spider_macro::scraped_item;