Changelog
Full release notes are on GitHub Releases.
kumo and kumo-derive are versioned independently - one may release without the other.
kumo
0.5.2 - 2026-06-19
- Improved production crawling, failure diagnosis, and benchmark interpretation documentation.
- Added a backward-compatible frontier lease/dead-letter API foundation with
FrontierLease,FrontierLeaseId,DeadLetterReason, and defaultFrontierlifecycle methods for future durable in-flight request recovery. - Added durable
FileFrontierlease storage withleases.json, restart recovery back to the pending queue, lease ack/release handling, lease expiry reclaim, anddead_letters.jsonpersistence. - Wired the scheduler and crawl engine to use durable frontier leases when a frontier opts in, including acking terminal requests and dead-lettering leased task panics.
- Added Redis-backed delayed requests, in-progress leases, lease expiry reclaim, and dead-letter handling for distributed
RedisFrontiercrawls. - Added priority-aware Redis ready queues while preserving FIFO order for equal priority requests and migrating older list-backed pending queues on poll.
- Added opt-in stats checkpoints that periodically write
CrawlReportJSON and refresh the file with the final crawl report at shutdown. - Durable frontiers now dead-letter leased requests when retry capacity is exhausted instead of only acking the terminal failure.
- Added proxy health tracking to
ProxyRotator, including per-proxy success and failure counters, configurable cooldown after consecutive failed request attempts, shared health snapshots from cloned rotators, and a no-cooldown mode that preserves the existing rotation API while retaining counters. - Added explicit per-proxy circuit-breaker reporting to
ProxyRotatorhealth snapshots withProxyCircuitState::{Healthy, Open, Recovering}so callers can distinguish closed, cooling-down, and trial-recovery proxies. - SQL stores now write
store_many()batches with one multi-row insert statement per transaction for SQLite, Postgres, and MySQL.
0.5.1 - 2026-06-04
- Added
otelproduction metrics for final crawl report counters, per-request successful fetch latency histograms, and bounded store queue/write signals. - Added opt-in HTTP-first browser fallback with
browser_fallback(...),browser_fallback_on(...), JavaScript-gated response detection, fallback failure recovery, and report JSON counters for fallback attempts, successes, and failures. - Applied Kumo's connection-pool size, request timeout, User-Agent, cookie, and TCP keepalive policy consistently to lazily created per-proxy HTTP clients while preserving isolated proxy cookie jars and pools.
- Applied equivalent pool, timeout, and keepalive settings to default and per-proxy stealth HTTP clients.
- Reduced scheduler bookkeeping allocations by borrowing request URL/domain metadata in robots-blocked, retry, and permanent-failure paths unless owned lifecycle event payloads are actually emitted.
- Removed an extra per-domain stats map lookup from every scheduled, completed, retry, error, and robots-blocked counter update.
- Cached request-task observer enablement so item-scraped and item-dropped hot paths bypass event dispatch checks when events and hooks are disabled.
- Reduced feature-matrix CI disk pressure by freeing unused hosted-tool caches before the heavy LLM lane and avoiding
targetcache uploads for feature permutations. - Updated GitHub Pages deployment actions to their Node 24-compatible major versions.
- Added
RetryReportandCrawlReport::retry_summary()so production reports expose retry pressure, retry exhaustion, and exhausted-failure rates in one typed summary and JSON object. - Added opt-in bounded store buffering with
CrawlEngine::store_buffer(...),StoreBufferConfig,StoreStats, report JSON fields, and batchedItemStore::store_many()writes for built-in file/stdout stores. - Added explicit store-buffer outage handling with
StoreFailurePolicy::Abort, first-failure reporting, failed write counters, and queue/write average and max latency fields in store stats and report JSON. - Hardened the stealth CI job with explicit timeouts and retry-safe package installation so runner package-manager hangs cannot block PRs indefinitely.
- Added transactional
store_many()implementations for SQLite, Postgres, and MySQL stores so buffered database writes commit one flushed batch per transaction. - Added
CrawlTimingStatsandCrawlReport::timingsfor cumulative successful-request phase timings across request middleware, fetch, response middleware, parse, pipeline, and store work. - Added timing breakdown fields to production report JSON exports and Kumo benchmark output.
- Documented how to interpret crawl timing totals under concurrent workloads.
- Reworked CSS extraction to parse each text response lazily once and share the parsed document across response selectors, nested element selectors, cloned elements, text extraction, and attribute access.
- Made
Element::outer_html()serialize lazily while preserving the existing owned, cloneable element API. - Added reusable
CssSelectorvalues plusResponse::css_with(...)andElement::css_with(...)for hot loops that need to bypass repeated global selector-cache lookups while preserving the existingcss(&str)API. - Reworked the scaling benchmark to use independent crawl chains, three-run medians, peak in-flight request measurement, and automatic throughput/memory saturation reporting across concurrency 1 through 64.
- Added manual 10,000-item soak and 100,000-item large-crawl benchmark modes with exact page/item validation, duplicate detection, peak RSS reporting, memory per 1,000 items, and first-10k throughput comparison.
- Added a deterministic realistic resilience benchmark with variable latency, variable response sizes, recoverable HTTP 429/503 responses, and typed cross-validation of crawl output, retry stats, and server counters.
- Added a correctness-gated realistic comparison for Kumo, Scrapy, and Colly, with reset failure schedules, equivalent retry limits, multi-start crawling, and asynchronous Colly execution.
- Hardened the realistic framework comparison with seeded, position-balanced repeated runs, per-sample correctness gates, and median throughput and memory reports with min-max ranges.
- Added a correctness-gated, balanced CI benchmark that isolates JSONL output overhead from fetching, parsing, scheduling, and other crawl work.
- Reused lazy request URL and normalized domain metadata across fingerprinting, scheduling, robots handling, allowed-domain filtering, statistics, and lifecycle events.
- Removed a redundant frontier-request clone from every dispatched task and deferred event-only request string allocations when events and hooks are disabled.
- Replaced
MemoryFrontier's linear priority scan with a stable binary heap, preserving priority and FIFO behavior while making queue push/pop operationsO(log n). - Added a CI-only frontier benchmark for 100-request and 10,000-request push/drain workloads.
- Added a CI-only scheduler benchmark for complete 100-request and 10,000-request push, dispatch, and finish lifecycles.
0.5.0 - 2026-06-03
- Added async crawl lifecycle hooks through the
CrawlHooktrait. - Added
HookErrorPolicyfor choosing whether hook failures are logged or abort the crawl. - Added
CrawlEngine::hook(...),CrawlEngine::hooks(...), andCrawlEngine::hook_error_policy(...). - Added
KumoErrorKind::HookandKumoError::hook(...)for hook failures. - Added crawl hooks documentation, tests, and a runnable
crawl_hooksexample.
0.4.1 - 2026-06-03
- Added
CrawlEvent::name()with stable snake_case event labels for dashboards, metrics, and event logs. - Added crawl event coverage for stream cancellation and task panic paths.
- Documented event labels and ordering expectations for event consumers.
0.4.0 - 2026-06-03
- Added typed crawl lifecycle events through
CrawlEvent. - Added
CrawlEngine::events(...)for caller-owned broadcast channels. - Added
CrawlEngine::event_channel(...)for convenient event subscription. - Added event coverage for crawl start/finish, scheduling, skips, request start/completion, retries, failures, task panics, scraped items, and dropped items.
- Added crawl events docs and a runnable
crawl_eventsexample.
0.3.16 - 2026-06-03
- Automated GitHub Release creation in the tag-driven publish workflow after crates.io publish steps complete.
- Added maintainer release process notes covering version bumps, changelog updates, tag formats, publish verification, and recovery rules.
0.3.15 - 2026-06-02
- Added a stealth contract test for
StealthProfiletowreq-utilemulation mapping. - Added stealth migration notes documenting the
rquesttowreqbackend change while keeping Kumo's public stealth API stable.
0.3.14 - 2026-06-01
- Replaced the optional
rqueststealth HTTP dependency withwreq, keeping Kumo's publicStealthHttpFetcherandStealthProfileAPI stable while removing the yanked upstreamrquestdependency from the planned stealth path. - Documented the
wreq-utillicense metadata risk for stealth releases.
0.3.13 - 2026-06-01
- Added dedicated CI coverage for compiling default and feature-gated examples.
- Synced the repository examples README with the live examples documentation and documented the
polite_crawlingexample in the docs site.
0.3.12 - 2026-06-01
- Added integration coverage for the structured logging contract so core crawl, request, and item-drop events keep stable
eventfields and production context.
0.3.11 - 2026-06-01
- Standardized structured tracing events with stable
eventfields across crawl, request, item, cache, and stream logs. - Added more consistent production log context such as spider name, domain, depth, attempt, retry limits, error kind, and multi-spider index.
0.3.10 - 2026-06-01
- Refreshed direct dependencies including Rig, Reqwest, Tokio, scraper, Redis, OpenTelemetry, object_store, SQLx, and Criterion.
- Raised the minimum supported Rust version to 1.94 for SQLx 0.9.
0.3.9 - 2026-06-01
- Added a typed crawl events design document covering goals, candidate event variants, subscription API shape, and the relationship between events, logging, OpenTelemetry, and crawl reports.
0.3.8 - 2026-06-01
- Hardened duration-budget integration tests so loaded CI runners do not expire the crawl before the first mock response can complete.
0.3.7 - 2026-06-01
- Added production report docs covering
CrawlReporthealth signals, alert examples, domain breakdowns, and operational report export patterns.
0.3.6 - 2026-05-31
- Added
CrawlReporthelper methods and JSON fields for pages/sec, items/sec, bytes/sec, error rate, success rate, and retry exhaustion rate.
0.3.5 - 2026-05-24
- Hardened logging with stable
tracingevent targets and event names for crawl lifecycle, request retry/skip, item drops, and HTTP cache activity. - Added production logging docs covering
RUST_LOG, structured targets, and JSON log setup.
0.3.4 - 2026-05-24
- Added
CrawlStats::error_kinds, per-domain error-kind counters, stableKumoErrorKind::as_str()labels, and JSON report export for failure breakdowns.
0.3.3 - 2026-05-24
- Hardened
FileFrontierflushes by syncing temporary state files before replacingqueue.jsonandseen.json, with best-effort directory sync on Unix platforms. - Added
CrawlStats::retry_exhausted, per-domain retry exhaustion counters, and JSON report export for exhausted retries.
0.3.2 - 2026-05-23
- Added stable JSON export helpers for
CrawlReport, including compact and pretty-printed report strings.
0.3.1 - 2026-05-23
- Made
max_duration()wake crawls promptly even when the scheduler is waiting on longer politeness or retry delays.
0.3.0 - 2026-05-23
- Expanded
FileFrontierresume coverage fordont_filterrequests and scheduler-normalized dedup fingerprints. - Made
CachingFetcherbypass non-GET requests and expanded HTTP cache coverage for TTL refetching, cached statuses, and binary-response bypass behavior. - Added crawl budget controls:
max_pages(),max_items(),max_duration(),max_errors(), andCrawlStats::stop_reason.
0.2.12 - 2026-05-23
- Added
CrawlStats::record_error()so global error counts and per-domain failure counts can be updated together. - Kept single-spider live metrics snapshots current after robots-blocked, retry, permanent failure, and task-panic events, not only successful pages.
- Documented the current
stealthdependency risk: the optional upstreamrquest 5.1.0dependency is yanked on crates.io, sostealthremains experimental until upstream publishes a healthy replacement. - Added a CI warning annotation for the locked yanked
rquestdependency so future releases keep thestealthrisk visible without breaking unrelated checks.
0.2.11 - 2026-05-20
- Published the main
kumocrate with thederivefeature pointing atkumo-derive 0.1.3, so users get the latest derive diagnostics andattr+reextraction behavior throughkumo. - Hardened the publish workflow so crates.io "already exists" races are treated as successful publishes instead of leaving a red release run after the crate is already available.
0.2.10 - 2026-05-16
- Polished release-facing README, crate metadata, and docs text for the
0.2.xline. - Aligned getting-started requirements with the crate MSRV (
rust-version = "1.88"). - Added
production_crawler.rs, a runnable production crawler template covering robots.txt, per-domain politeness, jitter, retry status filtering,Retry-After,StatusRetry, persistentFileFrontierrecovery state, metrics, and JSONL storage. - Updated release checklist notes for the current
0.2.xpublish flow.
0.2.9 - 2026-05-16
- Added
FileFrontier::state()andFileFrontierStatefor inspecting recovered queue, seen, and flush configuration after opening persisted state. - Made
FileFrontier::flush_every(0)disable automatic flushes instead of risking invalid flush interval behavior; explicit and engine shutdown flushes still persist state. - Expanded
FileFrontierrecovery tests and documented exact resume guarantees and current in-flight crash recovery limits.
0.2.8 - 2026-05-16
- Added
Retry-Aftersupport forStatusRetrywithout changingKumoError::HttpStatus. - Retry scheduling now prefers valid
Retry-Afterhints and caps them withRetryPolicy::max_delay. - Added public retry delay helper coverage for delta-seconds, HTTP-date, invalid header fallback, and capped delay hints.
0.2.7 - 2026-05-15
- Count crawl task panics as request failures in both single-spider and multi-spider runs.
- Attribute task-panic failures to the correct domain in
CrawlStats. - Strengthened stats coverage for scheduled, completed, deduped, failed, and panic paths.
0.2.6 - 2026-05-15
- Added public
RetryPolicyintrospection helpers for retry count, delay bounds, jitter state, status filters, and retryable error classification. - Added
KumoError::http_status(),KumoError::status_code(), andKumoError::url()helpers without changing existing error variants. - Fixed retry middleware documentation so
StatusRetryandRetryPolicyexamples match the current API.
0.2.5 - 2026-05-15
- Added
KumoErrorKindandKumoError::kind()for stable error classification without string matching. - Added ergonomic constructors for invalid URL, LLM, and browser errors.
- Cleaned parse/store error display text and strengthened error source tests.
0.2.4 - 2026-05-15
- Added robots.txt
Crawl-delayparsing for matching user-agent groups. - Wired cached robots crawl-delay values into the scheduler when
PolitenessPolicy::respect_robots_crawl_delay(true)is enabled. - Added robots parser and scheduler coverage for crawl-delay behavior.
0.2.3 - 2026-05-15
- Applied
PolitenessPolicy::jitterto per-domain scheduler delays so completed requests can spread follow-up traffic by a random extra delay. - Added unit and scheduler coverage for jitter delay behavior.
0.2.2 - 2026-05-15
- Hardened
FileFrontierflushes by writing temporary files before replacing persisted queue and seen state. - Documented
FileFrontierresume guarantees and the current in-flight crash recovery limitation.
0.2.1 - 2026-05-15
- Fixed scheduler fairness when delayed or domain-blocked requests share a frontier with ready requests from other domains.
- Reduced idle polling by letting the engine sleep until the scheduler's next known ready time when all queued requests are delayed.
- Added regression coverage for delayed high-priority requests.
0.2.0 — 2026-05-10
- Added production crawl scheduler controls with
PolitenessPolicy. - Added scheduler-level delayed retry timing so retry waits do not occupy worker tasks.
- Added
FingerprintPolicyfor canonical request deduplication. - Added
CrawlStats,CrawlReport, and per-domain scheduler counters. - Added frontier flushing on engine shutdown so persistent frontiers can resume queued request state more reliably.
0.1.1 — 2026-05-10
- Updated crate metadata so the crates.io documentation link points to
https://kumo.wihlarkop.com.
0.1.0 — 2026-05-10
- Fixed
CrawlEngine::stream()cancellation so dropping the item stream stops the background crawl after the next attempted send instead of continuing to drain the frontier. - Updated
rustls-webpkiinCargo.lockto addressRUSTSEC-2026-0104. - Added security policy and audit configuration for the optional MySQL dependency advisory that has no upstream fix yet.
- Hardened release CI with broad feature checks, docs build checks, package dry-runs, and separate publish paths for
kumoandkumo-derive. - Added
CrawlRequestfor follow-up request scheduling with priority, custom headers, method/body, metadata, and per-request duplicate-filter bypass. -
Renamed the middleware request context to
FetchRequestbefore the 0.1.0 release to avoid confusion with spider-scheduled crawl requests. -
CloudStore— backend-agnostic cloud storage viaobject_store; supports S3, GCS, Azure Blob, local filesystem, and in-memory backends through a unifiedArc<dyn ObjectStore>interface - New feature flags:
cloud,cloud-s3,cloud-gcs,cloud-azure CloudFormat::Jsonl(default) andCloudFormat::Jsonoutput formats- Auto-generated timestamped filenames; configurable via
.filename()and.prefix()
Initial feature set
- Async-first crawl engine via Tokio (
CrawlEngine::builder()) - CSS, regex, XPath, JSONPath selectors
- LLM extraction via Claude, OpenAI, Gemini, Ollama
- Rate limiting, auto-throttle, retry with backoff
JsonlStore,JsonStore,CsvStore,StdoutStore- PostgreSQL, SQLite, MySQL stores
- Item pipelines (
DropDuplicates,FilterPipeline,RequireFields) MemoryFrontier,FileFrontier,RedisFrontierLinkExtractorwith allow/deny filtering- HTTP response cache, Bloom filter dedup, robots.txt
- Headless browser fetcher, stealth mode
- Multi-spider engine
CrawlEngine::stream()— async item stream with backpressureSitemapSpider- OpenTelemetry OTLP/gRPC export (
otelfeature)
kumo-derive
Unreleased
- Added backward-compatible
#[derive(Extract)]support for numeric scalar fields,bool,Option<T>, andVec<T>whereTisString,bool, or a supported numeric scalar. - Improved derived scalar parse failures so
KumoErrormessages include the field name, target type, raw value, and parse error.
0.1.3 - 2026-05-16
- Added clear compile-time diagnostics for unsupported field types. The derive macro now accepts only
StringandOption<String>fields instead of producing later Rust type errors. - Added a compile-time diagnostic for fields with multiple
#[extract(...)]attributes. - Made
attrandrecompose: when both are present, the macro reads the attribute first and then applies the regex to that attribute value. - Updated
kumo-deriveREADME and crate metadata for thekumo 0.2line.
0.1.2 — 2026-04-25
- Added
default = "value"— fallback string forStringfields - Added
transform = "trim|lowercase|uppercase"— post-extraction transform
0.1.1 — 2026-04-25
- Added crate metadata:
authors,rust-version,documentation,exclude
0.1.0 — 2026-04-21
#[derive(Extract)]proc-macro for structs with named fieldscss,attr,re,textfield optionsllm_fallback— CSS-first with LLM fallbackStringfields default to"",Option<String>toNone