Spiders

A spider is a struct that implements the Spider trait. It tells kumo where to start, how to parse each page, and what items to emit.

The Spider Trait

#[async_trait::async_trait]
pub trait Spider: Send + Sync {
    type Item: serde::Serialize + Send;

    fn name(&self) -> &str;
    fn start_urls(&self) -> Vec<String>;

    async fn parse(
        &self,
        response: &Response,
    ) -> Result<Output<Self::Item>, KumoError>;

    // --- Optional hooks ---

    /// Called once before the crawl starts.
    async fn open(&self) -> Result<(), KumoError> { Ok(()) }

    /// Called once after the crawl finishes.
    async fn close(&self, stats: &CrawlStats) -> Result<(), KumoError> { Ok(()) }

    /// Only crawl these domains (empty = no restriction).
    fn allowed_domains(&self) -> Vec<&str> { vec![] }

    /// Stop following links deeper than this.
    fn max_depth(&self) -> Option<usize> { None }

    /// How to handle a fetch/parse error for a URL.
    fn on_error(&self, _url: &str, _err: &KumoError) -> ErrorPolicy {
        ErrorPolicy::Skip
    }
}

Output

parse() returns Output<T> — a builder that collects items and requests to follow:

Output::new()
    .item(my_item)                    // add one item
    .items(vec![a, b, c])             // add many items
    .follow("https://next-page")      // enqueue a GET request
    .follow_many(links)               // enqueue many GET requests
    .request(
        CrawlRequest::get("https://example.com/high-priority")
            .priority(10)
            .dont_filter(true),
    )

Items are serialized to JSON exactly once and passed to pipelines and the store. Use CrawlRequest when a follow-up request needs custom priority, headers, method/body, metadata, or duplicate filtering behavior.

Lifecycle Hooks

#[async_trait::async_trait]
impl Spider for MySpider {
    // ...

    async fn open(&self) -> Result<(), KumoError> {
        // e.g. open a database connection, create a temp file
        println!("crawl starting");
        Ok(())
    }

    async fn close(&self, stats: &CrawlStats) -> Result<(), KumoError> {
        println!(
            "done: {} pages, {} items, {} errors",
            stats.pages_crawled, stats.items_scraped, stats.errors
        );
        Ok(())
    }
}

CrawlStats fields:

Field	Type	Description
`pages_crawled`	`u64`	Responses processed
`items_scraped`	`u64`	Items passed to the store
`errors`	`u64`	Failed requests
`duration`	`Duration`	Wall-clock crawl time
`bytes_downloaded`	`u64`	Total response body bytes
`timings`	`CrawlTimingStats`	Cumulative successful-request phase timings for middleware, fetch, parse, pipeline, and store work
`interrupted`	`bool`	`true` if stopped by Ctrl+C
`error_kinds`	`BTreeMap<String, u64>`	Permanent failures grouped by stable `KumoErrorKind` label
`stop_reason`	`Option<StopReason>`	Why the crawl stopped
`scheduled`	`u64`	Requests accepted by the scheduler
`deduped`	`u64`	Requests skipped because their fingerprint was already seen
`retries`	`u64`	Retry attempts requeued by retry policy or `ErrorPolicy::Retry`
`retry_exhausted`	`u64`	URLs that permanently failed after retry capacity was exhausted
`robots_blocked`	`u64`	Requests skipped because robots.txt disallowed them
`domains`	`BTreeMap<String, DomainStats>`	Per-domain counters for scheduled, deduped, completed, failed, error kinds, retries, retry exhaustion, and robots-blocked requests

errors counts permanent request failures, including exhausted retries, unhandled fetch/parse errors, and crawl task panics. Panics are attributed to the request's domain in domains[domain].failed so production reports do not silently lose failed work. Use retry_exhausted when alerts need to distinguish "we retried and still failed" from one-off permanent failures. Use CrawlReport::retry_summary() when alerts need a compact production signal that separates retry pressure from retry exhaustion. Use error_kinds when alerts or dashboards need to separate parse failures, HTTP status failures, fetch failures, and other KumoErrorKind categories. Use timings to identify the largest successful-request phase. Timing totals are cumulative across concurrent tasks, so they can be larger than duration.

When updating stats manually, use record_error(domain) to increment both the global error count and the matching per-domain failure count together. Use record_error_kind(domain, kind) when you also know the KumoErrorKind.

stop_reason is set when the crawl ends:

Reason	Meaning
`FrontierExhausted`	No scheduled or in-flight requests remain
`Interrupted`	The crawl received Ctrl+C or stream cancellation
`MaxPages`	`max_pages()` was reached
`MaxItems`	`max_items()` was reached after a response finished
`MaxDuration`	`max_duration()` was reached
`MaxErrors`	`max_errors()` was reached

Use CrawlReport::from(stats) when you need a stable snapshot for logging or export. Reports can be exported directly with to_json_value(), to_json_string(), or to_json_string_pretty():

let stats = CrawlEngine::builder()
    .run(MySpider)
    .await?;

let report = CrawlReport::from(stats);
std::fs::write("crawl-report.json", report.to_json_string_pretty())?;

CrawlReport also exposes derived helpers for production dashboards and alerts:

Helper	Meaning
`pages_per_second()`	Successful pages divided by crawl duration
`items_per_second()`	Scraped items divided by crawl duration
`bytes_per_second()`	Downloaded response bytes divided by crawl duration
`error_rate()`	Failed requests divided by completed and failed requests
`success_rate()`	Completed requests divided by completed and failed requests
`retry_exhaustion_rate()`	Retry-exhausted requests divided by retry attempts
`retry_summary()`	Retry attempts, exhausted retries, retry pressure, exhaustion, and exhausted-failure rates

Report JSON uses stable snake_case field names. duration is exported as duration_ms and duration_secs, derived helper values are exported as fields such as pages_per_second and error_rate, timing breakdowns are exported under timings, retry health is exported under retry_summary, and stop_reason is exported as a snake_case string such as "frontier_exhausted" or "max_pages".

Error Handling

on_error lets each spider decide what to do with a failed URL:

fn on_error(&self, url: &str, err: &KumoError) -> ErrorPolicy {
    if matches!(err.kind(), kumo::error::KumoErrorKind::DomainNotAllowed)
        || url.contains("/optional/")
    {
        ErrorPolicy::Skip    // log and continue
    } else {
        ErrorPolicy::Abort   // stop the entire crawl
    }
}

Use err.kind() when you need stable error classification for metrics, logging, or custom retry decisions. This avoids matching on display text.

Domain & Depth Filtering

fn allowed_domains(&self) -> Vec<&str> {
    vec!["example.com"]  // subdomains are included automatically
}

fn max_depth(&self) -> Option<usize> {
    Some(3)  // don't follow links more than 3 hops from start_urls
}

CrawlEngine Builder

CrawlEngine::builder() is a fluent builder that configures and launches the engine:

CrawlEngine::builder()
    .concurrency(8)                           // max parallel requests (default: 8)
    .crawl_delay(Duration::from_millis(500))  // fixed delay between requests
    .retry(3, Duration::from_millis(200))     // retry up to 3× with 200ms base delay
    .respect_robots_txt(true)                 // honours robots.txt (default: true)
    .max_urls(500_000)                        // Bloom filter size (default: 1_000_000)
    .max_pages(10_000)                        // stop after enough pages
    .max_items(100_000)                       // stop after enough items
    .max_duration(Duration::from_secs(3600))  // stop after elapsed wall-clock time
    .max_errors(100)                          // stop after permanent failures
    .metrics_interval(Duration::from_secs(30))
    .middleware(DefaultHeaders::new().user_agent("my-bot/1.0"))
    .store(JsonlStore::new("output.jsonl")?)
    .run(MySpider)
    .await?;

Multi-Spider Engine

Run multiple independent spiders in one process — each gets its own frontier:

CrawlEngine::builder()
    .concurrency(4)
    .add_spider(QuotesSpider)
    .add_spider(BooksSpider)
    .run_all()
    .await?;

Each spider's parse() is called only for URLs in its own frontier. Items from all spiders flow to the same store.