Skip to content

Spiders

A spider is a struct that implements the Spider trait. It tells kumo where to start, how to parse each page, and what items to emit.

The Spider Trait

#[async_trait::async_trait]
pub trait Spider: Send + Sync {
    type Item: serde::Serialize + Send;

    fn name(&self) -> &str;
    fn start_urls(&self) -> Vec<String>;

    async fn parse(
        &self,
        response: &Response,
    ) -> Result<Output<Self::Item>, KumoError>;

    // --- Optional hooks ---

    /// Called once before the crawl starts.
    async fn open(&self) -> Result<(), KumoError> { Ok(()) }

    /// Called once after the crawl finishes.
    async fn close(&self, stats: &CrawlStats) -> Result<(), KumoError> { Ok(()) }

    /// Only crawl these domains (empty = no restriction).
    fn allowed_domains(&self) -> Vec<&str> { vec![] }

    /// Stop following links deeper than this.
    fn max_depth(&self) -> Option<usize> { None }

    /// How to handle a fetch/parse error for a URL.
    fn on_error(&self, _url: &str, _err: &KumoError) -> ErrorPolicy {
        ErrorPolicy::Skip
    }
}

Output

parse() returns Output<T> — a builder that collects items and requests to follow:

Output::new()
    .item(my_item)                    // add one item
    .items(vec![a, b, c])             // add many items
    .follow("https://next-page")      // enqueue a GET request
    .follow_many(links)               // enqueue many GET requests
    .request(
        CrawlRequest::get("https://example.com/high-priority")
            .priority(10)
            .dont_filter(true),
    )

Items are serialized to JSON exactly once and passed to pipelines and the store. Use CrawlRequest when a follow-up request needs custom priority, headers, method/body, metadata, or duplicate filtering behavior.

Lifecycle Hooks

#[async_trait::async_trait]
impl Spider for MySpider {
    // ...

    async fn open(&self) -> Result<(), KumoError> {
        // e.g. open a database connection, create a temp file
        println!("crawl starting");
        Ok(())
    }

    async fn close(&self, stats: &CrawlStats) -> Result<(), KumoError> {
        println!(
            "done: {} pages, {} items, {} errors",
            stats.pages_crawled, stats.items_scraped, stats.errors
        );
        Ok(())
    }
}

CrawlStats fields:

Field Type Description
pages_crawled u64 Responses processed
items_scraped u64 Items passed to the store
errors u64 Failed requests
duration Duration Wall-clock crawl time
bytes_downloaded u64 Total response body bytes
timings CrawlTimingStats Cumulative successful-request phase timings for middleware, fetch, parse, pipeline, and store work
interrupted bool true if stopped by Ctrl+C
error_kinds BTreeMap<String, u64> Permanent failures grouped by stable KumoErrorKind label
stop_reason Option<StopReason> Why the crawl stopped
scheduled u64 Requests accepted by the scheduler
deduped u64 Requests skipped because their fingerprint was already seen
retries u64 Retry attempts requeued by retry policy or ErrorPolicy::Retry
retry_exhausted u64 URLs that permanently failed after retry capacity was exhausted
robots_blocked u64 Requests skipped because robots.txt disallowed them
domains BTreeMap<String, DomainStats> Per-domain counters for scheduled, deduped, completed, failed, error kinds, retries, retry exhaustion, and robots-blocked requests

errors counts permanent request failures, including exhausted retries, unhandled fetch/parse errors, and crawl task panics. Panics are attributed to the request's domain in domains[domain].failed so production reports do not silently lose failed work. Use retry_exhausted when alerts need to distinguish "we retried and still failed" from one-off permanent failures. Use CrawlReport::retry_summary() when alerts need a compact production signal that separates retry pressure from retry exhaustion. Use error_kinds when alerts or dashboards need to separate parse failures, HTTP status failures, fetch failures, and other KumoErrorKind categories. Use timings to identify the largest successful-request phase. Timing totals are cumulative across concurrent tasks, so they can be larger than duration.

When updating stats manually, use record_error(domain) to increment both the global error count and the matching per-domain failure count together. Use record_error_kind(domain, kind) when you also know the KumoErrorKind.

stop_reason is set when the crawl ends:

Reason Meaning
FrontierExhausted No scheduled or in-flight requests remain
Interrupted The crawl received Ctrl+C or stream cancellation
MaxPages max_pages() was reached
MaxItems max_items() was reached after a response finished
MaxDuration max_duration() was reached
MaxErrors max_errors() was reached

Use CrawlReport::from(stats) when you need a stable snapshot for logging or export. Reports can be exported directly with to_json_value(), to_json_string(), or to_json_string_pretty():

let stats = CrawlEngine::builder()
    .run(MySpider)
    .await?;

let report = CrawlReport::from(stats);
std::fs::write("crawl-report.json", report.to_json_string_pretty())?;

CrawlReport also exposes derived helpers for production dashboards and alerts:

Helper Meaning
pages_per_second() Successful pages divided by crawl duration
items_per_second() Scraped items divided by crawl duration
bytes_per_second() Downloaded response bytes divided by crawl duration
error_rate() Failed requests divided by completed and failed requests
success_rate() Completed requests divided by completed and failed requests
retry_exhaustion_rate() Retry-exhausted requests divided by retry attempts
retry_summary() Retry attempts, exhausted retries, retry pressure, exhaustion, and exhausted-failure rates

Report JSON uses stable snake_case field names. duration is exported as duration_ms and duration_secs, derived helper values are exported as fields such as pages_per_second and error_rate, timing breakdowns are exported under timings, retry health is exported under retry_summary, and stop_reason is exported as a snake_case string such as "frontier_exhausted" or "max_pages".

Error Handling

on_error lets each spider decide what to do with a failed URL:

fn on_error(&self, url: &str, err: &KumoError) -> ErrorPolicy {
    if matches!(err.kind(), kumo::error::KumoErrorKind::DomainNotAllowed)
        || url.contains("/optional/")
    {
        ErrorPolicy::Skip    // log and continue
    } else {
        ErrorPolicy::Abort   // stop the entire crawl
    }
}

Use err.kind() when you need stable error classification for metrics, logging, or custom retry decisions. This avoids matching on display text.

Domain & Depth Filtering

fn allowed_domains(&self) -> Vec<&str> {
    vec!["example.com"]  // subdomains are included automatically
}

fn max_depth(&self) -> Option<usize> {
    Some(3)  // don't follow links more than 3 hops from start_urls
}

CrawlEngine Builder

CrawlEngine::builder() is a fluent builder that configures and launches the engine:

CrawlEngine::builder()
    .concurrency(8)                           // max parallel requests (default: 8)
    .crawl_delay(Duration::from_millis(500))  // fixed delay between requests
    .retry(3, Duration::from_millis(200))     // retry up to 3× with 200ms base delay
    .respect_robots_txt(true)                 // honours robots.txt (default: true)
    .max_urls(500_000)                        // Bloom filter size (default: 1_000_000)
    .max_pages(10_000)                        // stop after enough pages
    .max_items(100_000)                       // stop after enough items
    .max_duration(Duration::from_secs(3600))  // stop after elapsed wall-clock time
    .max_errors(100)                          // stop after permanent failures
    .metrics_interval(Duration::from_secs(30))
    .middleware(DefaultHeaders::new().user_agent("my-bot/1.0"))
    .store(JsonlStore::new("output.jsonl")?)
    .run(MySpider)
    .await?;

Multi-Spider Engine

Run multiple independent spiders in one process — each gets its own frontier:

CrawlEngine::builder()
    .concurrency(4)
    .add_spider(QuotesSpider)
    .add_spider(BooksSpider)
    .run_all()
    .await?;

Each spider's parse() is called only for URLs in its own frontier. Items from all spiders flow to the same store.