Examples
All examples live in the examples/ folder. Run any of them with cargo run --example <name>.
Basic Spiders
quotes.rs - minimal spider
Scrapes all quotes from quotes.toscrape.com, following pagination. The simplest possible kumo spider - CSS selectors and JsonlStore.
books.rs - rate limiting + retry
Scrapes all 1000 books from books.toscrape.com across 50 pages. Demonstrates RateLimiter, exponential retry, allowed_domains, max_depth, and JsonStore.
books_derive.rs - #[derive(Extract)]
Same as books.rs but uses #[derive(Extract)] with field annotations instead of manual CSS selectors.
multi_spider.rs - multiple spiders
Runs two independent spiders (quotes + books) concurrently in a single engine using .add_spider() / .run_all().
Selectors
selectors.rs - CSS, regex, JSONPath
Demonstrates CSS, regex, and JSONPath selectors against local HTML and JSON - no network required.
# CSS + regex
cargo run --example selectors
# CSS + regex + JSONPath
cargo run --example selectors --features jsonpath
xpath.rs - XPath selectors
Demonstrates XPath selectors on an HTML response using the xpath feature.
Middleware
autothrottle.rs - adaptive throttling
Shows AutoThrottle middleware adapting request delay based on server latency and 429/503 responses.
proxy_rotation.rs - proxy rotation
Demonstrates ProxyRotator middleware cycling through a list of proxy URLs.
polite_crawling.rs - polite crawl scheduling
Shows PolitenessPolicy, per-domain concurrency, per-domain delay, request priority, metadata, fingerprint-based deduplication, and crawl stats.
Stores
sqlite.rs - SQLite store
Stores scraped items into a local SQLite file.
postgres.rs - PostgreSQL store
Stores scraped items into PostgreSQL. Requires a running Postgres instance.
cloud.rs - Cloud storage (S3 / GCS / Azure / local)
Stores scraped items as JSONL via the backend-agnostic CloudStore. The example uses LocalFileSystem - no cloud credentials needed. Swap the backend for AmazonS3, GoogleCloudStorage, or MicrosoftAzure with no other code changes.
LLM Extraction
llm_extract.rs - LLM extraction
Scrapes quotes.toscrape.com without any CSS selectors - the LLM reads the HTML and fills in the struct automatically.
Swap the feature flag and client to use a different provider:
| Provider | Flag | Client |
|---|---|---|
| Anthropic Claude | claude | AnthropicClient |
| OpenAI | openai | OpenAiClient |
| Google Gemini | gemini | GeminiClient |
| Ollama (local) | ollama | OllamaClient |
llm_fallback.rs - CSS + LLM fallback
Uses #[extract(llm_fallback = "hint")] - tries CSS first and falls back to the LLM only when the selector returns nothing.
Advanced
production_crawler.rs - production crawl controls
Combines the production defaults most crawlers need: robots.txt, per-domain concurrency, per-domain delay, jitter, Retry-After aware retries, StatusRetry, persistent FileFrontier recovery state, metrics, and JSONL storage.
crawl_events.rs - typed lifecycle events
Subscribes to typed crawl lifecycle events with .event_channel() and prints request completion plus final crawl totals. Uses MockFetcher, so it runs without network access.
crawl_hooks.rs - crawl lifecycle hooks
Registers an async CrawlHook that counts completed requests and scraped items. Uses MockFetcher, so it runs without network access.
http_cache.rs - HTTP response cache
Demonstrates disk-backed response caching. Run once to populate the cache, run again to see instant responses from disk.
link_extractor.rs - link extraction with filtering
Demonstrates LinkExtractor with allow_domains, allow, deny, restrict_css, and canonicalize.
request_scheduling.rs - request scheduling
Demonstrates CrawlRequest with custom method/body, headers, priority, and metadata.
browser.rs - headless browser
Fetches a JS-rendered page using headless Chromium. Requires the browser feature.
stealth.rs - stealth mode
Sends requests with a Chrome 131 TLS fingerprint using the stealth feature. Requires cmake and nasm.