kumo
kumo means spider/cloud in Japanese. It is an async web crawling framework for Rust - Scrapy for Rust.
It gives you a trait-based, async-first API for writing spiders that scrape, follow links, and store structured data - with batteries included for production crawls.
Why kumo?
| kumo | Scrapy (Python) | Colly (Go) | |
|---|---|---|---|
| Language | Rust | Python | Go |
| Type safety | Compile-time | Runtime | Partial |
| Async model | Tokio (true async) | Twisted (event loop) | goroutines |
| Memory safety | Guaranteed | GC | GC |
| CSS / XPath / Regex / JSONPath | Yes | Yes | CSS only |
#[derive(Extract)] macro | Yes | No | No |
| LLM extraction (Claude / OpenAI / Gemini / Ollama) | Yes | No | No |
| Browser / JS rendering | Yes (chromiumoxide) | Yes (Playwright) | No |
| Stealth mode (TLS/HTTP2 fingerprint spoofing) | Yes | No | No |
| Distributed frontier (Redis) | Yes | Yes (scrapy-redis) | No |
| Item stream API | Yes | No | No |
| OpenTelemetry export | Yes | No | No |
| Pluggable stores (JSONL, CSV, Postgres, SQLite, MySQL) | Yes | Yes (pipelines) | No |
| Single binary deploy | Yes | No | Yes |
| Binary size / startup | Small / instant | Large / slow | Small / fast |
Benchmark snapshot - 1,000 books, concurrency 16, median of 3 runs:
| kumo | Colly (Go) | Scrapy (Python) | |
|---|---|---|---|
| Real site - Items/s | 76.7 | 73.5 | 53.3 |
| Local server - Items/s | 12,346 | 4,098 | 180 |
| Peak RSS | 12.5 MB | 31.4 MB | 77.2 MB |
On this local-server parsing workload, Kumo measured 3.0x faster than Colly, 69x faster than Scrapy. Treat these as workload-specific results, not universal production guarantees. Full methodology and reproduction steps in benchmark/.
Quick Install
[dependencies]
kumo = "0.2"
async-trait = "0.1"
serde = { version = "1", features = ["derive"] }
tokio = { version = "1", features = ["full"] }
30-Second Example
use kumo::prelude::*;
use serde::Serialize;
#[derive(Debug, Serialize)]
struct Quote {
text: String,
author: String,
}
struct QuotesSpider;
#[async_trait::async_trait]
impl Spider for QuotesSpider {
type Item = Quote;
fn name(&self) -> &str { "quotes" }
fn start_urls(&self) -> Vec<String> {
vec!["https://quotes.toscrape.com".into()]
}
async fn parse(&self, res: &Response) -> Result<Output<Self::Item>, KumoError> {
let quotes: Vec<Quote> = res.css(".quote").iter().map(|el| Quote {
text: el.css(".text").first().map(|e| e.text()).unwrap_or_default(),
author: el.css(".author").first().map(|e| e.text()).unwrap_or_default(),
}).collect();
let next = res.css("li.next a").first()
.and_then(|el| el.attr("href"))
.map(|href| res.urljoin(&href));
let mut output = Output::new().items(quotes);
if let Some(url) = next { output = output.follow(url); }
Ok(output)
}
}
#[tokio::main]
async fn main() -> Result<(), KumoError> {
CrawlEngine::builder()
.concurrency(5)
.middleware(DefaultHeaders::new().user_agent("kumo/0.2"))
.store(JsonlStore::new("quotes.jsonl")?)
.run(QuotesSpider)
.await?;
Ok(())
}