spider_lib/lib.rs
1//! # spider-lib
2//!
3//! `spider-lib` is the easiest way to use this workspace as an application
4//! framework. It re-exports the crawler runtime, common middleware and
5//! pipelines, shared request and response types, and the `#[scraped_item]`
6//! macro behind one crate.
7//!
8//! If you want the lower-level pieces individually, the workspace also exposes
9//! `spider-core`, `spider-middleware`, `spider-pipeline`, `spider-downloader`,
10//! `spider-macro`, and `spider-util`. Most users should start here.
11//!
12//! ## What you get from the facade crate
13//!
14//! The root crate is optimized for application authors:
15//!
16//! - [`prelude`] re-exports the common types needed to define and run a spider
17//! - [`Spider`] describes crawl behavior
18//! - [`CrawlerBuilder`] assembles the runtime
19//! - [`Request`], [`Response`], [`ParseContext`], and [`ParseOutput`] are the core runtime types
20//! - [`Response::css`](spider_util::response::Response::css) provides Scrapy-like builtin selectors
21//! - middleware and pipelines can be enabled with feature flags and then added
22//! through the builder
23//!
24//! ## Installation
25//!
26//! ```toml
27//! [dependencies]
28//! spider-lib = "4.0.0"
29//! serde = { version = "1.0", features = ["derive"] }
30//! serde_json = "1.0"
31//! ```
32//!
33//! `serde` and `serde_json` are required when you use [`scraped_item`].
34//!
35//! ## Quick start
36//!
37//! ```rust,ignore
38//! use spider_lib::prelude::*;
39//!
40//! #[scraped_item]
41//! struct Quote {
42//! text: String,
43//! author: String,
44//! source_url: String,
45//! }
46//!
47//! struct QuotesSpider;
48//!
49//! #[async_trait]
50//! impl Spider for QuotesSpider {
51//! type Item = Quote;
52//! type State = ();
53//!
54//! fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
55//! Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
56//! }
57//!
58//! async fn parse(&self, cx: ParseContext<'_, Self>) -> Result<(), SpiderError> {
59//! for quote in cx.css(".quote")? {
60//! let text = quote
61//! .css(".text::text")?
62//! .get()
63//! .unwrap_or_default();
64//!
65//! let author = quote
66//! .css(".author::text")?
67//! .get()
68//! .unwrap_or_default();
69//!
70//! cx.add_item(Quote {
71//! text,
72//! author,
73//! source_url: cx.url.to_string(),
74//! })
75//! .await?;
76//! }
77//!
78//! if let Some(next_href) = cx.css("li.next a::attr(href)")?.get() {
79//! let next_url = cx.url.join(&next_href)?;
80//! cx.add_request(Request::new(next_url)).await?;
81//! }
82//!
83//! Ok(())
84//! }
85//! }
86//!
87//! #[tokio::main]
88//! async fn main() -> Result<(), SpiderError> {
89//! let crawler = CrawlerBuilder::new(QuotesSpider)
90//! .log_level(LevelFilter::Info)
91//! .build()
92//! .await?;
93//! crawler.start_crawl().await
94//! }
95//! ```
96//!
97//! The built-in selector API is the recommended path for HTML extraction:
98//! `cx.css(".card")?`, `node.css("a::attr(href)")?.get()`, and
99//! `node.css(".title::text")?.get()`.
100//!
101//! [`Spider::parse`] takes `&self` and a single [`ParseContext`] parameter.
102//! That design keeps the spider itself immutable while still giving parse logic
103//! access to the current response, shared state, and async output methods.
104//!
105//! ## Typical next steps
106//!
107//! After the minimal spider works, the next additions are usually:
108//!
109//! 1. add one or more middleware with [`CrawlerBuilder::add_middleware`]
110//! 2. add one or more pipelines with [`CrawlerBuilder::add_pipeline`]
111//! 3. move repeated parse-time state into [`Spider::State`]
112//! 4. enable optional features such as `live-stats`, `pipeline-csv`, or
113//! `middleware-robots`
114//!
115//! If you find yourself needing transport-level customization, custom
116//! middleware contracts, or lower-level runtime control, move down to the
117//! crate-specific APIs in `spider-core`, `spider-downloader`,
118//! `spider-middleware`, or `spider-pipeline`.
119
120extern crate self as spider_lib;
121
122pub mod prelude;
123/// Re-export the application-facing prelude.
124///
125/// Most examples and first integrations start with:
126///
127/// ```rust
128/// use spider_lib::prelude::*;
129/// ```
130pub use prelude::*;
131pub use log;
132pub use spider_core::route_by_rule;
133
134// Re-export procedural macros
135pub use spider_macro::scraped_item;