Production Reports
A successful Kumo crawl returns CrawlStats. Convert it into CrawlReport when a production job needs a stable summary for logs, dashboards, alerting, or files:
let stats = CrawlEngine::builder()
.run(MySpider)
.await?;
let report = CrawlReport::from(stats);
std::fs::write("crawl-report.json", report.to_json_string_pretty())?;
Reports keep the raw counters and derived health signals together. The raw counters tell you what happened; the derived helpers make common alerts easy to write without duplicating rate math in every crawler.
For long-running crawls, enable checkpointed reports so Kumo periodically writes the current report snapshot and overwrites it again when the crawl finishes:
let stats = CrawlEngine::builder()
.stats_checkpoint("crawl-report.checkpoint.json")
.run(MySpider)
.await?;
Use a custom interval when the default 30-second cadence is too frequent or too slow:
let stats = CrawlEngine::builder()
.stats_checkpoint_interval(
"crawl-report.checkpoint.json",
std::time::Duration::from_secs(300),
)
.run(MySpider)
.await?;
The checkpoint file uses the same JSON shape as CrawlReport::to_json_value(). For run_all(), Kumo writes a JSON array with one report per spider.
Health Signals
| Field or helper | Use it for |
|---|---|
pages_crawled | Completed response count |
items_scraped | Extracted item volume |
errors | Permanent request, parse, store, or task failures |
error_kinds | Grouping failures by stable KumoErrorKind labels |
scheduled | Requests accepted by the scheduler |
deduped | Requests dropped by fingerprint deduplication |
retries | Retry attempts scheduled by retry policy or spider error policy |
retry_exhausted | Requests that still failed after retry capacity was used |
retry_summary() | Retry attempts, exhausted retries, retry pressure, exhaustion, and exhausted-failure rates |
browser_fallbacks | Requests retried through the browser fallback path |
browser_fallback_successes | Browser fallback attempts that returned a rendered response |
browser_fallback_failures | Browser fallback attempts that failed and kept the original HTTP response |
robots_blocked | Requests blocked by robots.txt |
stop_reason | Why the crawl stopped |
pages_per_second() | Crawl throughput |
items_per_second() | Extraction throughput |
bytes_per_second() | Download throughput |
error_rate() | Permanent failures divided by completed and failed requests |
success_rate() | Completed requests divided by completed and failed requests |
retry_exhaustion_rate() | Retry-exhausted requests divided by retry attempts |
timings | Cumulative successful-request phase timings for bottleneck diagnosis |
store | Store-buffer queue/write counters when store_buffer() is enabled |
The JSON export uses the same names in snake_case, including derived fields such as pages_per_second, error_rate, retry_exhaustion_rate, retry_summary, timings, and store.
When browser_fallback(...) is enabled, the browser fallback counters help show how much of the crawl required JavaScript rendering. A high fallback rate usually means browser rendering is part of the target's normal path and crawl concurrency should be sized for browser cost, not HTTP cost.
Store Summary
CrawlReport::store is always present. When store_buffer() is not enabled, its counters are zero and buffered is false. When buffering is enabled, it shows how much item output moved through the bounded writer:
| Field | Meaning |
|---|---|
buffered | Whether the bounded store buffer was enabled |
queued | Items accepted into the store queue |
written | Items written by the background writer |
batches | Non-empty batch write calls |
failed_writes | Items in batches that returned a store error |
failed_batches | Non-empty batch write calls that returned a store error |
queue_full_waits | Times request tasks observed a full store queue |
queue_wait | Total time request tasks spent waiting for queue capacity |
queue_wait_max | Longest queue wait for one accepted item |
average_queue_wait_per_item() | Average queue wait per accepted item |
write | Total time the writer spent inside store write attempts |
write_max | Longest single batch write attempt |
average_write_per_batch() | Average write latency per batch attempt |
average_write_per_item() | Average write latency per written or failed item |
If queue_full_waits or queue_wait is high, the item store is backpressuring the crawl. That is often better than unbounded memory growth, but it means throughput is now limited by item persistence.
The JSON export includes millisecond and second fields for totals, averages, and maxes:
| JSON field | Meaning |
|---|---|
queue_wait_ms, queue_wait_secs | Total enqueue wait |
queue_wait_avg_ms, queue_wait_avg_secs | Average enqueue wait per item |
queue_wait_max_ms, queue_wait_max_secs | Maximum enqueue wait for one item |
write_ms, write_secs | Total backend write-attempt time |
write_avg_batch_ms, write_avg_batch_secs | Average backend write time per batch attempt |
write_avg_item_ms, write_avg_item_secs | Average backend write time per written or failed item |
write_max_ms, write_max_secs | Maximum backend write time for one batch attempt |
Buffered stores use StoreFailurePolicy::Abort by default. On the first store error, Kumo stops the buffered writer and run() returns that error instead of continuing with possible item loss. The final flush happens before buffered counters are copied into CrawlStats, so a failed flush produces no final CrawlReport; failed_writes and failed_batches cannot be used to diagnose that failed run from a final report. Capture and log the returned error. An error observed while writing a background batch is also logged as store.buffer_error; if the error first surfaces during final flush, use the returned error and the backend's logs or telemetry.
Retry Summary
CrawlReport::retry_summary() groups retry health into one small report:
| Field | Meaning |
|---|---|
attempts | Total retry attempts that were scheduled |
exhausted | Requests that still failed after retry capacity was used |
pressure_rate | Retry attempts divided by scheduled requests |
exhaustion_rate | Exhausted retries divided by retry attempts |
exhausted_failure_rate | Exhausted retries divided by permanent errors |
Use the fields together:
- high
pressure_ratemeans the target or network is making many requests retry, even if the crawl eventually recovers; - high
exhaustion_ratemeans retries are often not helping; - high
exhausted_failure_ratemeans permanent failures are mostly retry exhaustion, which often points to rate limits, blocking, or sustained upstream instability.
The JSON report includes the same values under retry_summary, while keeping the top-level retries, retry_exhausted, and retry_exhaustion_rate fields for compatibility.
Timing Breakdown
CrawlReport::timings splits successful request work into broad phases:
| Field | Measures |
|---|---|
middleware_request | Time spent in Middleware::before_request |
fetch | Time spent inside the configured fetcher's fetch call |
middleware_response | Time spent in Middleware::after_response |
parse | Time spent in the spider parse method |
pipeline | Time spent in item pipelines |
store | Time spent writing accepted items to the item store |
These are cumulative task timings, not exclusive wall-clock percentages. In a concurrent crawl, the sum can be higher than duration because many requests run at the same time. Use the largest phase as a direction signal: high fetch usually points to target, network, proxy, browser-fallback, connection-pool, or other fetcher latency; high parse points to selector/extraction work; and high store points to output backpressure. Scheduler eligibility and politeness waiting happen before the request task calls the fetcher, so they are not included in timings.fetch. Diagnose those waits from configured limits, throughput, and scheduler logs rather than fetch timing.
Failure Diagnosis Checklist
Start with stop_reason, errors, error_kinds, and domains. Those fields tell you whether the crawl stopped because it finished normally, reached a configured budget, was interrupted, or failed because one stage could not make progress.
Use the signals together instead of treating one counter as the full diagnosis:
| Signal | Likely direction |
|---|---|
High robots_blocked | The target's robots policy excluded many URLs. Check user-agent and scope before disabling robots on authorized/internal crawls. |
High deduped | The spider is scheduling repeated URLs, or the fingerprint policy is intentionally collapsing variants. Check pagination and tracking parameters. |
High retry_summary().pressure_rate with low exhaustion | The target or network is noisy, but retries are recovering. Watch latency and politeness. |
High retry_summary().exhaustion_rate | Retry capacity is being spent without recovery. Check rate limits, blocking, credentials, target errors, or retry status filters. |
High retry_summary().exhausted_failure_rate | Permanent failures are mostly retry exhaustion rather than parse or store errors. |
High timings.fetch | Target, network, proxy, browser-fallback, connection-pool, or other fetcher latency may dominate the crawl. This does not measure scheduler or politeness waiting. |
High timings.parse | Selector work, extraction logic, or item construction may dominate successful requests. |
High timings.store | Direct item writes are slowing request tasks. Consider a store buffer or a faster output path. |
High store.queue_full_waits or store.queue_wait | The bounded store writer is backpressuring the crawl. |
run() returns a store error with buffering enabled | The buffered writer or final flush failed. No final report is returned; diagnose the returned error, store.buffer_error logs when present, and backend logs or telemetry. |
For one unhealthy domain in a multi-domain crawl, inspect report.domains[domain] before changing global settings. A global error rate can look acceptable while one domain has no successful pages.
Use events or hooks when the final report is not enough. RequestFailed, RequestRetried, RequestSkipped, ItemDropped, and TaskPanicked include the crawl context needed to correlate a final counter with URLs, depths, attempts, and reasons observed during the run.
When a durable frontier is enabled, inspect the frontier state alongside the report. FileFrontier::state().await shows recovered queued and seen counts after opening a frontier. Its persisted files also separate pending requests (queue.json), in-flight leases (leases.json), terminal dead letters (dead_letters.json), and deduplication state (seen.json). For Redis-backed frontiers, inspect the Redis keys derived from the configured queue key. Pending or leased work after an interrupted crawl can explain why a resumed job starts with existing state instead of only the spider's seed URLs.
Alert Examples
Use error_rate() for broad crawl health. A nonzero error rate is normal on the open web, but a sudden increase usually means the target changed, credentials expired, the crawler is being blocked, or a store is failing.
if report.error_rate() > 0.10 {
tracing::warn!(
error_rate = report.error_rate(),
errors = report.errors,
pages = report.pages_crawled,
"crawl error rate exceeded threshold"
);
}
Use retry_summary() when retry attempts are happening. This separates retry pressure from retry exhaustion:
let retry = report.retry_summary();
if retry.attempts > 0 && retry.exhaustion_rate > 0.25 {
tracing::warn!(
retry_exhaustion_rate = retry.exhaustion_rate,
retry_pressure_rate = retry.pressure_rate,
retry_exhausted_failure_rate = retry.exhausted_failure_rate,
retries = retry.attempts,
retry_exhausted = retry.exhausted,
"retry exhaustion exceeded threshold"
);
}
Use pages_per_second() and items_per_second() for throughput alerts. Compare them against your own historical baseline instead of a universal threshold.
Domain Breakdowns
domains contains per-domain scheduler and failure counters. Use it when one domain is unhealthy but the whole crawl still looks fine:
for (domain, stats) in &report.domains {
if stats.failed > 0 && stats.completed == 0 {
tracing::warn!(
domain,
failed = stats.failed,
error_kinds = ?stats.error_kinds,
"domain had failures and no completed pages"
);
}
}
Per-domain reports are especially useful for multi-domain crawls where one target can be blocked, slow, or unavailable without stopping the entire job.
Operational Pattern
For scheduled production crawls, write the report next to your scraped output and send the same values to logs or metrics:
let report = CrawlReport::from(stats);
let retry = report.retry_summary();
tracing::info!(
pages = report.pages_crawled,
items = report.items_scraped,
errors = report.errors,
error_rate = report.error_rate(),
pages_per_second = report.pages_per_second(),
retry_pressure_rate = retry.pressure_rate,
retry_exhaustion_rate = retry.exhaustion_rate,
retry_exhausted_failure_rate = retry.exhausted_failure_rate,
store_buffered = report.store.buffered,
store_queued = report.store.queued,
store_written = report.store.written,
store_failed_writes = report.store.failed_writes,
store_queue_full_waits = report.store.queue_full_waits,
store_queue_wait_avg_secs = report.store.average_queue_wait_per_item().as_secs_f64(),
store_write_avg_batch_secs = report.store.average_write_per_batch().as_secs_f64(),
fetch_secs = report.timings.fetch.as_secs_f64(),
parse_secs = report.timings.parse.as_secs_f64(),
store_secs = report.timings.store.as_secs_f64(),
stop_reason = report.stop_reason.map(StopReason::as_str),
"crawl report"
);
std::fs::write("crawl-report.json", report.to_json_string_pretty())?;
Keep the JSON report as the durable audit record. Use structured logs for live operations and OpenTelemetry when you need centralized metrics or traces.
With the otel feature enabled and kumo::otel::init() active, Kumo records the same production report counters as OTLP metrics at crawl completion. It also records successful request fetch latency as a histogram during the crawl. See OpenTelemetry for metric names and attributes.