HTTP Cache
kumo can cache HTTP responses to disk, skipping real network requests when the same URL is fetched again. Useful during development to avoid hammering sites while iterating on parse() logic.
Usage
CrawlEngine::builder()
.http_cache("./cache") // cache responses in ./cache directory
.run(MySpider)
.await?;
Responses are stored by URL hash. On subsequent runs, cached responses are served from disk instantly.
Only GET responses are cached. Requests with bodies or non-GET methods bypass the cache so one request variant cannot be accidentally replayed for another. Binary responses also bypass cache writes.
When browser_fallback(...) is enabled, the cache wraps the full fetcher. If a URL falls back to the browser and returns rendered HTML, that rendered response is what later cache hits replay. Disable the cache for production crawls that must re-evaluate whether a page still needs browser rendering.
TTL
Set a maximum cache age:
CrawlEngine::builder()
.http_cache("./cache")
.cache_ttl(Duration::from_secs(60 * 60)) // expire entries after 1 hour
.run(MySpider)
.await?;
Expired entries are refetched and the cache is updated.
When to Use
- Development — iterate on selectors without network requests
- Re-processing — re-run
parse()logic on already-fetched pages - Rate-limited targets — reduce the number of live requests
Warning
Do not use the HTTP cache in production crawls that need fresh data — cached responses bypass your crawl delay and auto-throttle.