Skip to content

HTTP Cache

kumo can cache HTTP responses to disk, skipping real network requests when the same URL is fetched again. Useful during development to avoid hammering sites while iterating on parse() logic.

Usage

CrawlEngine::builder()
    .http_cache("./cache")          // cache responses in ./cache directory
    .run(MySpider)
    .await?;

Responses are stored by URL hash. On subsequent runs, cached responses are served from disk instantly.

TTL

Set a maximum cache age:

CrawlEngine::builder()
    .http_cache("./cache")
    .cache_ttl(Duration::from_secs(60 * 60))   // expire entries after 1 hour
    .run(MySpider)
    .await?;

Expired entries are refetched and the cache is updated.

When to Use

  • Development — iterate on selectors without network requests
  • Re-processing — re-run parse() logic on already-fetched pages
  • Rate-limited targets — reduce the number of live requests

Warning

Do not use the HTTP cache in production crawls that need fresh data — cached responses bypass your crawl delay and auto-throttle.