Performance
Measure Extraction Work
Kumo's benchmark results record root CSS queries, nested CSS queries, text collections, and attribute reads. These operation counts verify that two performance runs completed the same extraction work before their throughput is compared.
The Criterion hot-path suite measures document parsing, cached root selection, nested selection, text collection, and attribute copying independently. Use those focused measurements to identify the next bottleneck; do not infer a specific extraction cost from the end-to-end crawl time alone.
Interpreting Benchmark Output
Treat Kumo benchmarks as workload-specific measurements, not universal speed guarantees. A benchmark result is useful only with its workload, target, feature flags, concurrency, store, machine, and validation criteria.
Before comparing two runs, confirm that both completed the same work:
| Check | Why it matters |
|---|---|
| Page and item counts match | Throughput is not meaningful if a crawler skipped work. |
| Duplicate and malformed-output checks pass | Fast incorrect output is not a valid result. |
| Retry, error, and HTTP status counters match expectations | Recovery behavior can change throughput and completeness. |
| Store output is included or intentionally isolated | JSONL, database, and no-store runs measure different systems. |
| Median and range are reported | One shared-runner sample can be dominated by runner noise. |
Use local-server modes to study framework overhead and extraction cost with network variance reduced. Use realistic modes to exercise retry behavior, variable latency, response size variance, and output validation. Use real-site numbers only as a snapshot of that site and run environment; target behavior, network path, and rate limiting can change outside Kumo.
Public claims should describe the exact benchmark mode and use medians and ranges from repeated, correctness-gated runs. Avoid generalizing a local-server or CI-runner result into expected production performance for unrelated sites.
Reuse Compiled Selectors
The regular css(&str) API uses a global compiled-selector cache, which avoids reparsing selector syntax. Very hot loops still pay for a cache lock and hash lookup on every call. Use CssSelector to compile once and bypass that lookup:
let products = CssSelector::parse("article.product_pod")?;
let titles = CssSelector::parse("h3 a")?;
for product in response.css_with(&products).iter() {
let title = product.css_with(&titles).first().map(|element| element.text());
}
Keep selectors on the spider or another long-lived configuration value rather than compiling them inside Spider::parse.
Reuse Parsed Responses
Kumo parses each text Response into an HTML document lazily on the first CSS query and reuses that document for later response and nested element queries. Cloned Element values keep the shared document alive, so they remain usable after the original Response is dropped.
Prefer several selectors against the same response or element instead of creating new Response values from serialized HTML. text(), attr(), and nested css() operate directly on the shared document. outer_html() performs serialization only when requested and caches the result for that element.
Request URL Metadata
Kumo parses each immutable CrawlRequest URL lazily and shares the resulting URL and normalized domain metadata across cloned requests. Fingerprinting, politeness scheduling, robots handling, allowed-domain checks, statistics, and events reuse that metadata automatically.
No crawler configuration is required. Requests restored from a persistent frontier rebuild their metadata lazily, so persisted state remains compatible and does not contain derived cache fields.
Request Task Ownership
Kumo's engine keeps one frontier-record copy for task panic recovery while the spawned task borrows its own record during fetching and parsing. This avoids an additional request clone for every dispatched page without weakening scheduler cleanup after a task panic.
Lifecycle event URL and domain strings are also created lazily. Crawls that do not configure an event receiver or hook avoid those event payload allocations automatically; enabling events or hooks preserves the same owned event values.
Robots-blocked, retry, and permanent-failure engine paths borrow request URL and domain metadata for stats, middleware, and tracing, then allocate owned event payload strings only when events or hooks are enabled. Per-domain stats updates also use a single map-entry lookup, keeping large single-domain crawls from paying extra bookkeeping work on every counter update.
Request tasks also cache whether events or hooks are enabled before entering the item loop. When observability is disabled, item-scraped and item-dropped hot paths bypass event dispatch checks and avoid owned event payload allocation entirely.
Allocator: jemalloc
For long-running crawls (minutes or longer), replacing the system allocator with jemalloc can improve throughput by reducing allocator fragmentation and contention under concurrent workloads.
// main.rs
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
Note
jemalloc pre-allocates arena space, so peak RSS will appear higher than the system allocator. This is expected — it is not a memory leak. The benefit shows up as reduced fragmentation and better multi-threaded allocation throughput over time.
Concurrency Tuning
The right concurrency value depends on your target site's capacity:
| Scenario | Recommended |
|---|---|
| Polite crawl (public site) | 8–16 |
| Internal / scraping-allowed site | 32–64 |
| Local mock / benchmarking | 64–128 |
Use the manually dispatched Benchmark workflow in scale mode to measure Kumo at concurrency 1, 4, 8, 16, 32, and 64. Scale mode uses 64 independent pagination chains so the frontier contains enough runnable work to exercise concurrency. It runs each level three times and reports median throughput, elapsed time, peak RSS, fetch/parse time, and peak requests in flight.
Validate Large Crawls
Use the manual benchmark workflow before making production-scale claims:
| Mode | Pages | Items |
|---|---|---|
soak | 500 | 10,000 |
large | 5,000 | 100,000 |
These workloads use 100 independent pagination chains and validate the engine, frontier, extraction, and JSONL store together. A run fails on an incorrect page count, incorrect item count, duplicate item, malformed JSONL output, or unsuccessful crawler process.
The report includes peak RSS and RSS per 1,000 items. It also compares throughput for the first 10,000 items with the remaining crawl. Do not treat one shared-runner result as a stable memory limit. Establish three consecutive successful 100k runs before publishing a large-crawl claim, then use those runs to define a bounded RSS envelope.
Large Pending Queues
MemoryFrontier uses a binary heap for request priority scheduling. Push and pop operations are O(log n), including when many requests are waiting at once. Equal-priority requests still preserve FIFO order.
Use the manual Benchmark workflow's frontier mode to measure isolated 100-request and 10,000-request push/drain batches. This benchmark bypasses deduplication so Bloom-filter false positives cannot affect the measured queue size.
Use scheduler mode to measure complete push, dispatch, and finish lifecycles at the same queue sizes.
Validate Retry Resilience
The manual Benchmark workflow's realistic mode exercises Kumo against a deterministic server with variable 20-120 ms latency, 1-128 KiB response payloads, and first-attempt HTTP 429/503 failures.
The workload contains 200 pages and 4,000 unique items over 20 independent pagination chains. Kumo must recover exactly 24 transient failures with no exhausted retries or final crawl errors. The report cross-checks crawler output with server-side request and status counters, so a green result proves both correct retry behavior and complete extraction.
Use this mode for production-behavior regressions. Use the nginx local mode for raw framework-overhead comparisons. Shared GitHub runner timing remains informational; correctness and retry counters are deterministic release gates.
Use realistic-compare when comparing Kumo with Scrapy and Colly. The server state is reset before every framework. The default three-run schedule rotates a seeded framework permutation so every framework runs once in each position. The report is rejected unless every framework in every run independently satisfies the same item, page, duplicate, retry, error, and HTTP status counters. Public comparisons should use the reported medians and ranges rather than one shared-runner sample.
Connection Pool
kumo automatically sets pool_max_idle_per_host to match the crawl's concurrency level, keeping connections warm across the full request window. Each URL selected by ProxyRotator receives its own cached client, cookie jar, and connection pool. Proxy clients inherit Kumo's concurrency, request timeout, User-Agent, and TCP keepalive settings.
You can tune the default reqwest::Client further via .http_client_builder():
CrawlEngine::builder()
.concurrency(32)
.http_client_builder(|b| {
b.pool_max_idle_per_host(32)
.tcp_keepalive(std::time::Duration::from_secs(60))
})
.run(MySpider)
.await?;
The callback is applied once to the default reqwest client. It is not replayed for dynamically created proxy clients and does not configure the wreq-backed stealth client.
Request Timeout
Hanging connections can stall the crawl engine. Set a per-request timeout to bound worst-case latency:
TLS and HTTP/2
kumo uses rustls (pure-Rust TLS) and HTTP/2 by default. No additional configuration is needed — sites that support HTTP/2 will automatically benefit from request multiplexing over fewer connections.
Disable robots.txt for Internal Crawls
By default kumo fetches robots.txt for every new domain — one extra HTTP round-trip per domain. For internal or authorized targets where you control the server, disable it:
Bloom Filter Sizing
kumo uses a Bloom filter for URL deduplication. The default is sized for 1 million unique URLs. For small crawls, reduce it to save memory; for very large crawls, increase it to reduce false-positive skips:
// Small crawl — save ~1 MB of memory
CrawlEngine::builder()
.max_urls(10_000)
.run(MySpider)
.await?;
// Large crawl — 10M URLs with low false-positive rate
CrawlEngine::builder()
.max_urls(10_000_000)
.run(MySpider)
.await?;
Store Choice
JsonlStore is the fastest store — it is append-only and never blocks on index lookups or transactions. For maximum throughput, write to JSONL and bulk-load into a database afterwards:
// Fast — append-only writes
CrawlEngine::builder()
.store(JsonlStore::new("items.jsonl")?)
.run(MySpider)
.await?;
If you need a database store, prefer SqliteStore for single-process crawls and PostgresStore for distributed ones. Avoid using a database store as the primary bottleneck in a high-concurrency crawl.
Store Backpressure
Direct store writes are simplest and remain the default. If item output is slower than fetching and parsing, direct writes make request tasks wait inside the store. That is correct, but it can hide whether the bottleneck is parsing, the store queue, or the backend itself.
Use .store_buffer(queue_capacity, batch_size) when you want a bounded producer/consumer boundary between request tasks and item persistence:
CrawlEngine::builder()
.concurrency(64)
.store(JsonlStore::new("items.jsonl")?)
.store_buffer(10_000, 250)
.run(MySpider)
.await?;
Tune queue_capacity for the amount of burst you are willing to hold in memory. Tune batch_size for the downstream writer. Append-oriented stores such as JsonlStore, JsonStore, CsvStore, and StdoutStore implement batched store_many() paths. SQL stores use one transaction and one multi-row insert statement per flushed batch, which reduces transaction overhead and database round trips. Other custom stores keep working through the default per-item implementation.
Watch report.store.queue_full_waits and report.store.queue_wait. If they grow, the store writer is the limiting stage. Reduce crawl concurrency, increase batch size, switch to JSONL and bulk-load later, or improve the downstream store.
Use report.store.average_queue_wait_per_item(), report.store.average_write_per_batch(), report.store.queue_wait_max, and report.store.write_max to distinguish steady pressure from short spikes. A high average queue wait means most request tasks are backpressured by item persistence. A high max with low averages usually points to bursty downstream latency.
Buffered writes default to StoreFailurePolicy::Abort. If the background writer sees a store error, Kumo records failed_writes and failed_batches, stops the writer, and returns a store error instead of silently dropping later items. Continue/drop behavior is intentionally not exposed yet because it needs durable retry or explicit loss accounting before it is safe for production crawls.
Don't Stack AutoThrottle and RateLimiter
AutoThrottle and RateLimiter both add delays — using both at the same time compounds them independently and will significantly reduce throughput. Pick one:
- Use
RateLimiterwhen you want a fixed maximum request rate. - Use
AutoThrottlewhen you want the engine to adapt automatically based on server response times.
// ✅ Pick one
CrawlEngine::builder()
.middleware(AutoThrottle::new()) // OR RateLimiter, not both
.run(MySpider)
.await?;
Stream Buffer Tuning
When using CrawlEngine::stream(), the default channel buffer is 100 items. If your consumer is slow (e.g. writing to a database row-by-row), the buffer fills up and backpressure stalls the crawl. Increase it to decouple producer and consumer:
Scheduler Politeness
PolitenessPolicy enforces per-domain concurrency and delay before requests are dispatched. Use it instead of sleeping inside spiders or middleware:
CrawlEngine::builder()
.concurrency(64)
.politeness(
PolitenessPolicy::new()
.per_domain_concurrency(4)
.per_domain_delay(std::time::Duration::from_millis(250)),
)
.run(MySpider)
.await?;
High global concurrency is useful only when the crawl spans enough domains or the target allows that load. For single-domain public crawls, the per-domain limits are usually the real throughput cap.
HTTP Cache for Development
Use .http_cache() during spider development to avoid re-fetching pages on every run. Cached responses are served from disk instantly, making iteration fast. Remove it before deploying to production:
CrawlEngine::builder()
.http_cache("./dev-cache")?
.cache_ttl(std::time::Duration::from_secs(3600)) // optional: expire after 1h
.run(MySpider)
.await?;
Depth and Domain Filtering
Without limits, a spider following <a> tags can crawl the entire internet. Always set allowed_domains() and consider max_depth() on your spider to keep crawls focused: