HTTP Cache

kumo can cache HTTP responses to disk, skipping real network requests when the same URL is fetched again. Useful during development to avoid hammering sites while iterating on parse() logic.

Usage

CrawlEngine::builder()
    .http_cache("./cache")          // cache responses in ./cache directory
    .run(MySpider)
    .await?;

Responses are stored by URL hash. On subsequent runs, cached responses are served from disk instantly.

Only GET responses are cached. Requests with bodies or non-GET methods bypass the cache so one request variant cannot be accidentally replayed for another. Binary responses also bypass cache writes.

When browser_fallback(...) is enabled, the cache wraps the full fetcher. If a URL falls back to the browser and returns rendered HTML, that rendered response is what later cache hits replay. Disable the cache for production crawls that must re-evaluate whether a page still needs browser rendering.

TTL

Set a maximum cache age:

CrawlEngine::builder()
    .http_cache("./cache")
    .cache_ttl(Duration::from_secs(60 * 60))   // expire entries after 1 hour
    .run(MySpider)
    .await?;

Expired entries are refetched and the cache is updated.

When to Use

Development — iterate on selectors without network requests
Re-processing — re-run parse() logic on already-fetched pages
Rate-limited targets — reduce the number of live requests

Warning

Do not use the HTTP cache in production crawls that need fresh data — cached responses bypass your crawl delay and auto-throttle.