Getting Started

Prerequisites

Rust 1.94+ (stable toolchain)
tokio runtime
async-trait crate

Installation

Add kumo to your Cargo.toml:

[dependencies]
kumo = "0.2"
async-trait = "0.1"
serde = { version = "1", features = ["derive"] }
tokio = { version = "1", features = ["full"] }

For optional features (database stores, browser mode, LLM extraction) see Feature Flags.

Your First Spider

A spider has four required parts:

An item type - a Serialize struct representing what you scrape
name() - a unique identifier for this spider
start_urls() - where the crawl begins
parse() - how to extract items and follow links from a response

use kumo::prelude::*;
use serde::Serialize;

#[derive(Debug, Serialize)]
struct Quote {
    text: String,
    author: String,
}

struct QuotesSpider;

#[async_trait::async_trait]
impl Spider for QuotesSpider {
    type Item = Quote;

    fn name(&self) -> &str { "quotes" }

    fn start_urls(&self) -> Vec<String> {
        vec!["https://quotes.toscrape.com".into()]
    }

    async fn parse(&self, res: &Response) -> Result<Output<Self::Item>, KumoError> {
        let quotes: Vec<Quote> = res.css(".quote").iter().map(|el| Quote {
            text:   el.css(".text").first().map(|e| e.text()).unwrap_or_default(),
            author: el.css(".author").first().map(|e| e.text()).unwrap_or_default(),
        }).collect();

        // Follow pagination
        let next = res.css("li.next a").first()
            .and_then(|el| el.attr("href"))
            .map(|href| res.urljoin(&href));

        let mut output = Output::new().items(quotes);
        if let Some(url) = next { output = output.follow(url); }
        Ok(output)
    }
}

Running the Crawl

Use CrawlEngine::builder() to configure and launch:

#[tokio::main]
async fn main() -> Result<(), KumoError> {
    CrawlEngine::builder()
        .concurrency(5)                                            // parallel requests
        .middleware(DefaultHeaders::new().user_agent("kumo/0.2")) // set User-Agent
        .store(JsonlStore::new("quotes.jsonl")?)                  // write to JSONL
        .run(QuotesSpider)
        .await?;
    Ok(())
}

This crawls all pages, writes each Quote as a JSON line to quotes.jsonl, and exits when the frontier is empty.

Polite Crawling

For production crawls, configure per-domain limits so Kumo does not treat every URL as one global queue:

use std::time::Duration;
use kumo::prelude::*;

CrawlEngine::builder()
    .concurrency(16)
    .max_pages(10_000)
    .max_duration(Duration::from_secs(60 * 60))
    .politeness(
        PolitenessPolicy::new()
            .per_domain_concurrency(2)
            .per_domain_delay(Duration::from_millis(500)),
    )
    .fingerprint_policy(FingerprintPolicy::default().strip_tracking_params(true))
    .run(QuotesSpider)
    .await?;

The scheduler handles request priority, per-domain delay, delayed retries, fingerprint-based deduplication, crawl budgets, and crawl stats. Inspect stats.stop_reason after run() or run_all() to see whether a crawl ended because the frontier was exhausted, it was interrupted, or a configured budget was reached. Convert stats into CrawlReport and call to_json_string_pretty() when production jobs need to save a crawl summary.

For a fuller production-style setup with Retry-After aware retries, FileFrontier resume state, metrics, robots.txt, and JSONL storage, see production_crawler.rs.

What's Next?

Spiders - full Spider trait API, lifecycle hooks, error handling
Extractors - CSS, XPath, Regex, JSONPath, #[derive(Extract)], LLM
Stores - JSONL, JSON, CSV, PostgreSQL, SQLite, MySQL
Middleware - rate limiting, auto-throttle, retry, proxy rotation