Skip to content

Changelog

Full release notes are on GitHub Releases.

kumo and kumo-derive are versioned independently - one may release without the other.


kumo

0.5.2 - 2026-06-19

  • Improved production crawling, failure diagnosis, and benchmark interpretation documentation.
  • Added a backward-compatible frontier lease/dead-letter API foundation with FrontierLease, FrontierLeaseId, DeadLetterReason, and default Frontier lifecycle methods for future durable in-flight request recovery.
  • Added durable FileFrontier lease storage with leases.json, restart recovery back to the pending queue, lease ack/release handling, lease expiry reclaim, and dead_letters.json persistence.
  • Wired the scheduler and crawl engine to use durable frontier leases when a frontier opts in, including acking terminal requests and dead-lettering leased task panics.
  • Added Redis-backed delayed requests, in-progress leases, lease expiry reclaim, and dead-letter handling for distributed RedisFrontier crawls.
  • Added priority-aware Redis ready queues while preserving FIFO order for equal priority requests and migrating older list-backed pending queues on poll.
  • Added opt-in stats checkpoints that periodically write CrawlReport JSON and refresh the file with the final crawl report at shutdown.
  • Durable frontiers now dead-letter leased requests when retry capacity is exhausted instead of only acking the terminal failure.
  • Added proxy health tracking to ProxyRotator, including per-proxy success and failure counters, configurable cooldown after consecutive failed request attempts, shared health snapshots from cloned rotators, and a no-cooldown mode that preserves the existing rotation API while retaining counters.
  • Added explicit per-proxy circuit-breaker reporting to ProxyRotator health snapshots with ProxyCircuitState::{Healthy, Open, Recovering} so callers can distinguish closed, cooling-down, and trial-recovery proxies.
  • SQL stores now write store_many() batches with one multi-row insert statement per transaction for SQLite, Postgres, and MySQL.

0.5.1 - 2026-06-04

  • Added otel production metrics for final crawl report counters, per-request successful fetch latency histograms, and bounded store queue/write signals.
  • Added opt-in HTTP-first browser fallback with browser_fallback(...), browser_fallback_on(...), JavaScript-gated response detection, fallback failure recovery, and report JSON counters for fallback attempts, successes, and failures.
  • Applied Kumo's connection-pool size, request timeout, User-Agent, cookie, and TCP keepalive policy consistently to lazily created per-proxy HTTP clients while preserving isolated proxy cookie jars and pools.
  • Applied equivalent pool, timeout, and keepalive settings to default and per-proxy stealth HTTP clients.
  • Reduced scheduler bookkeeping allocations by borrowing request URL/domain metadata in robots-blocked, retry, and permanent-failure paths unless owned lifecycle event payloads are actually emitted.
  • Removed an extra per-domain stats map lookup from every scheduled, completed, retry, error, and robots-blocked counter update.
  • Cached request-task observer enablement so item-scraped and item-dropped hot paths bypass event dispatch checks when events and hooks are disabled.
  • Reduced feature-matrix CI disk pressure by freeing unused hosted-tool caches before the heavy LLM lane and avoiding target cache uploads for feature permutations.
  • Updated GitHub Pages deployment actions to their Node 24-compatible major versions.
  • Added RetryReport and CrawlReport::retry_summary() so production reports expose retry pressure, retry exhaustion, and exhausted-failure rates in one typed summary and JSON object.
  • Added opt-in bounded store buffering with CrawlEngine::store_buffer(...), StoreBufferConfig, StoreStats, report JSON fields, and batched ItemStore::store_many() writes for built-in file/stdout stores.
  • Added explicit store-buffer outage handling with StoreFailurePolicy::Abort, first-failure reporting, failed write counters, and queue/write average and max latency fields in store stats and report JSON.
  • Hardened the stealth CI job with explicit timeouts and retry-safe package installation so runner package-manager hangs cannot block PRs indefinitely.
  • Added transactional store_many() implementations for SQLite, Postgres, and MySQL stores so buffered database writes commit one flushed batch per transaction.
  • Added CrawlTimingStats and CrawlReport::timings for cumulative successful-request phase timings across request middleware, fetch, response middleware, parse, pipeline, and store work.
  • Added timing breakdown fields to production report JSON exports and Kumo benchmark output.
  • Documented how to interpret crawl timing totals under concurrent workloads.
  • Reworked CSS extraction to parse each text response lazily once and share the parsed document across response selectors, nested element selectors, cloned elements, text extraction, and attribute access.
  • Made Element::outer_html() serialize lazily while preserving the existing owned, cloneable element API.
  • Added reusable CssSelector values plus Response::css_with(...) and Element::css_with(...) for hot loops that need to bypass repeated global selector-cache lookups while preserving the existing css(&str) API.
  • Reworked the scaling benchmark to use independent crawl chains, three-run medians, peak in-flight request measurement, and automatic throughput/memory saturation reporting across concurrency 1 through 64.
  • Added manual 10,000-item soak and 100,000-item large-crawl benchmark modes with exact page/item validation, duplicate detection, peak RSS reporting, memory per 1,000 items, and first-10k throughput comparison.
  • Added a deterministic realistic resilience benchmark with variable latency, variable response sizes, recoverable HTTP 429/503 responses, and typed cross-validation of crawl output, retry stats, and server counters.
  • Added a correctness-gated realistic comparison for Kumo, Scrapy, and Colly, with reset failure schedules, equivalent retry limits, multi-start crawling, and asynchronous Colly execution.
  • Hardened the realistic framework comparison with seeded, position-balanced repeated runs, per-sample correctness gates, and median throughput and memory reports with min-max ranges.
  • Added a correctness-gated, balanced CI benchmark that isolates JSONL output overhead from fetching, parsing, scheduling, and other crawl work.
  • Reused lazy request URL and normalized domain metadata across fingerprinting, scheduling, robots handling, allowed-domain filtering, statistics, and lifecycle events.
  • Removed a redundant frontier-request clone from every dispatched task and deferred event-only request string allocations when events and hooks are disabled.
  • Replaced MemoryFrontier's linear priority scan with a stable binary heap, preserving priority and FIFO behavior while making queue push/pop operations O(log n).
  • Added a CI-only frontier benchmark for 100-request and 10,000-request push/drain workloads.
  • Added a CI-only scheduler benchmark for complete 100-request and 10,000-request push, dispatch, and finish lifecycles.

0.5.0 - 2026-06-03

  • Added async crawl lifecycle hooks through the CrawlHook trait.
  • Added HookErrorPolicy for choosing whether hook failures are logged or abort the crawl.
  • Added CrawlEngine::hook(...), CrawlEngine::hooks(...), and CrawlEngine::hook_error_policy(...).
  • Added KumoErrorKind::Hook and KumoError::hook(...) for hook failures.
  • Added crawl hooks documentation, tests, and a runnable crawl_hooks example.

0.4.1 - 2026-06-03

  • Added CrawlEvent::name() with stable snake_case event labels for dashboards, metrics, and event logs.
  • Added crawl event coverage for stream cancellation and task panic paths.
  • Documented event labels and ordering expectations for event consumers.

0.4.0 - 2026-06-03

  • Added typed crawl lifecycle events through CrawlEvent.
  • Added CrawlEngine::events(...) for caller-owned broadcast channels.
  • Added CrawlEngine::event_channel(...) for convenient event subscription.
  • Added event coverage for crawl start/finish, scheduling, skips, request start/completion, retries, failures, task panics, scraped items, and dropped items.
  • Added crawl events docs and a runnable crawl_events example.

0.3.16 - 2026-06-03

  • Automated GitHub Release creation in the tag-driven publish workflow after crates.io publish steps complete.
  • Added maintainer release process notes covering version bumps, changelog updates, tag formats, publish verification, and recovery rules.

0.3.15 - 2026-06-02

  • Added a stealth contract test for StealthProfile to wreq-util emulation mapping.
  • Added stealth migration notes documenting the rquest to wreq backend change while keeping Kumo's public stealth API stable.

0.3.14 - 2026-06-01

  • Replaced the optional rquest stealth HTTP dependency with wreq, keeping Kumo's public StealthHttpFetcher and StealthProfile API stable while removing the yanked upstream rquest dependency from the planned stealth path.
  • Documented the wreq-util license metadata risk for stealth releases.

0.3.13 - 2026-06-01

  • Added dedicated CI coverage for compiling default and feature-gated examples.
  • Synced the repository examples README with the live examples documentation and documented the polite_crawling example in the docs site.

0.3.12 - 2026-06-01

  • Added integration coverage for the structured logging contract so core crawl, request, and item-drop events keep stable event fields and production context.

0.3.11 - 2026-06-01

  • Standardized structured tracing events with stable event fields across crawl, request, item, cache, and stream logs.
  • Added more consistent production log context such as spider name, domain, depth, attempt, retry limits, error kind, and multi-spider index.

0.3.10 - 2026-06-01

  • Refreshed direct dependencies including Rig, Reqwest, Tokio, scraper, Redis, OpenTelemetry, object_store, SQLx, and Criterion.
  • Raised the minimum supported Rust version to 1.94 for SQLx 0.9.

0.3.9 - 2026-06-01

  • Added a typed crawl events design document covering goals, candidate event variants, subscription API shape, and the relationship between events, logging, OpenTelemetry, and crawl reports.

0.3.8 - 2026-06-01

  • Hardened duration-budget integration tests so loaded CI runners do not expire the crawl before the first mock response can complete.

0.3.7 - 2026-06-01

  • Added production report docs covering CrawlReport health signals, alert examples, domain breakdowns, and operational report export patterns.

0.3.6 - 2026-05-31

  • Added CrawlReport helper methods and JSON fields for pages/sec, items/sec, bytes/sec, error rate, success rate, and retry exhaustion rate.

0.3.5 - 2026-05-24

  • Hardened logging with stable tracing event targets and event names for crawl lifecycle, request retry/skip, item drops, and HTTP cache activity.
  • Added production logging docs covering RUST_LOG, structured targets, and JSON log setup.

0.3.4 - 2026-05-24

  • Added CrawlStats::error_kinds, per-domain error-kind counters, stable KumoErrorKind::as_str() labels, and JSON report export for failure breakdowns.

0.3.3 - 2026-05-24

  • Hardened FileFrontier flushes by syncing temporary state files before replacing queue.json and seen.json, with best-effort directory sync on Unix platforms.
  • Added CrawlStats::retry_exhausted, per-domain retry exhaustion counters, and JSON report export for exhausted retries.

0.3.2 - 2026-05-23

  • Added stable JSON export helpers for CrawlReport, including compact and pretty-printed report strings.

0.3.1 - 2026-05-23

  • Made max_duration() wake crawls promptly even when the scheduler is waiting on longer politeness or retry delays.

0.3.0 - 2026-05-23

  • Expanded FileFrontier resume coverage for dont_filter requests and scheduler-normalized dedup fingerprints.
  • Made CachingFetcher bypass non-GET requests and expanded HTTP cache coverage for TTL refetching, cached statuses, and binary-response bypass behavior.
  • Added crawl budget controls: max_pages(), max_items(), max_duration(), max_errors(), and CrawlStats::stop_reason.

0.2.12 - 2026-05-23

  • Added CrawlStats::record_error() so global error counts and per-domain failure counts can be updated together.
  • Kept single-spider live metrics snapshots current after robots-blocked, retry, permanent failure, and task-panic events, not only successful pages.
  • Documented the current stealth dependency risk: the optional upstream rquest 5.1.0 dependency is yanked on crates.io, so stealth remains experimental until upstream publishes a healthy replacement.
  • Added a CI warning annotation for the locked yanked rquest dependency so future releases keep the stealth risk visible without breaking unrelated checks.

0.2.11 - 2026-05-20

  • Published the main kumo crate with the derive feature pointing at kumo-derive 0.1.3, so users get the latest derive diagnostics and attr + re extraction behavior through kumo.
  • Hardened the publish workflow so crates.io "already exists" races are treated as successful publishes instead of leaving a red release run after the crate is already available.

0.2.10 - 2026-05-16

  • Polished release-facing README, crate metadata, and docs text for the 0.2.x line.
  • Aligned getting-started requirements with the crate MSRV (rust-version = "1.88").
  • Added production_crawler.rs, a runnable production crawler template covering robots.txt, per-domain politeness, jitter, retry status filtering, Retry-After, StatusRetry, persistent FileFrontier recovery state, metrics, and JSONL storage.
  • Updated release checklist notes for the current 0.2.x publish flow.

0.2.9 - 2026-05-16

  • Added FileFrontier::state() and FileFrontierState for inspecting recovered queue, seen, and flush configuration after opening persisted state.
  • Made FileFrontier::flush_every(0) disable automatic flushes instead of risking invalid flush interval behavior; explicit and engine shutdown flushes still persist state.
  • Expanded FileFrontier recovery tests and documented exact resume guarantees and current in-flight crash recovery limits.

0.2.8 - 2026-05-16

  • Added Retry-After support for StatusRetry without changing KumoError::HttpStatus.
  • Retry scheduling now prefers valid Retry-After hints and caps them with RetryPolicy::max_delay.
  • Added public retry delay helper coverage for delta-seconds, HTTP-date, invalid header fallback, and capped delay hints.

0.2.7 - 2026-05-15

  • Count crawl task panics as request failures in both single-spider and multi-spider runs.
  • Attribute task-panic failures to the correct domain in CrawlStats.
  • Strengthened stats coverage for scheduled, completed, deduped, failed, and panic paths.

0.2.6 - 2026-05-15

  • Added public RetryPolicy introspection helpers for retry count, delay bounds, jitter state, status filters, and retryable error classification.
  • Added KumoError::http_status(), KumoError::status_code(), and KumoError::url() helpers without changing existing error variants.
  • Fixed retry middleware documentation so StatusRetry and RetryPolicy examples match the current API.

0.2.5 - 2026-05-15

  • Added KumoErrorKind and KumoError::kind() for stable error classification without string matching.
  • Added ergonomic constructors for invalid URL, LLM, and browser errors.
  • Cleaned parse/store error display text and strengthened error source tests.

0.2.4 - 2026-05-15

  • Added robots.txt Crawl-delay parsing for matching user-agent groups.
  • Wired cached robots crawl-delay values into the scheduler when PolitenessPolicy::respect_robots_crawl_delay(true) is enabled.
  • Added robots parser and scheduler coverage for crawl-delay behavior.

0.2.3 - 2026-05-15

  • Applied PolitenessPolicy::jitter to per-domain scheduler delays so completed requests can spread follow-up traffic by a random extra delay.
  • Added unit and scheduler coverage for jitter delay behavior.

0.2.2 - 2026-05-15

  • Hardened FileFrontier flushes by writing temporary files before replacing persisted queue and seen state.
  • Documented FileFrontier resume guarantees and the current in-flight crash recovery limitation.

0.2.1 - 2026-05-15

  • Fixed scheduler fairness when delayed or domain-blocked requests share a frontier with ready requests from other domains.
  • Reduced idle polling by letting the engine sleep until the scheduler's next known ready time when all queued requests are delayed.
  • Added regression coverage for delayed high-priority requests.

0.2.0 — 2026-05-10

  • Added production crawl scheduler controls with PolitenessPolicy.
  • Added scheduler-level delayed retry timing so retry waits do not occupy worker tasks.
  • Added FingerprintPolicy for canonical request deduplication.
  • Added CrawlStats, CrawlReport, and per-domain scheduler counters.
  • Added frontier flushing on engine shutdown so persistent frontiers can resume queued request state more reliably.

0.1.1 — 2026-05-10

  • Updated crate metadata so the crates.io documentation link points to https://kumo.wihlarkop.com.

0.1.0 — 2026-05-10

  • Fixed CrawlEngine::stream() cancellation so dropping the item stream stops the background crawl after the next attempted send instead of continuing to drain the frontier.
  • Updated rustls-webpki in Cargo.lock to address RUSTSEC-2026-0104.
  • Added security policy and audit configuration for the optional MySQL dependency advisory that has no upstream fix yet.
  • Hardened release CI with broad feature checks, docs build checks, package dry-runs, and separate publish paths for kumo and kumo-derive.
  • Added CrawlRequest for follow-up request scheduling with priority, custom headers, method/body, metadata, and per-request duplicate-filter bypass.
  • Renamed the middleware request context to FetchRequest before the 0.1.0 release to avoid confusion with spider-scheduled crawl requests.

  • CloudStore — backend-agnostic cloud storage via object_store; supports S3, GCS, Azure Blob, local filesystem, and in-memory backends through a unified Arc<dyn ObjectStore> interface

  • New feature flags: cloud, cloud-s3, cloud-gcs, cloud-azure
  • CloudFormat::Jsonl (default) and CloudFormat::Json output formats
  • Auto-generated timestamped filenames; configurable via .filename() and .prefix()

Initial feature set

  • Async-first crawl engine via Tokio (CrawlEngine::builder())
  • CSS, regex, XPath, JSONPath selectors
  • LLM extraction via Claude, OpenAI, Gemini, Ollama
  • Rate limiting, auto-throttle, retry with backoff
  • JsonlStore, JsonStore, CsvStore, StdoutStore
  • PostgreSQL, SQLite, MySQL stores
  • Item pipelines (DropDuplicates, FilterPipeline, RequireFields)
  • MemoryFrontier, FileFrontier, RedisFrontier
  • LinkExtractor with allow/deny filtering
  • HTTP response cache, Bloom filter dedup, robots.txt
  • Headless browser fetcher, stealth mode
  • Multi-spider engine
  • CrawlEngine::stream() — async item stream with backpressure
  • SitemapSpider
  • OpenTelemetry OTLP/gRPC export (otel feature)

kumo-derive

Unreleased

  • Added backward-compatible #[derive(Extract)] support for numeric scalar fields, bool, Option<T>, and Vec<T> where T is String, bool, or a supported numeric scalar.
  • Improved derived scalar parse failures so KumoError messages include the field name, target type, raw value, and parse error.

0.1.3 - 2026-05-16

  • Added clear compile-time diagnostics for unsupported field types. The derive macro now accepts only String and Option<String> fields instead of producing later Rust type errors.
  • Added a compile-time diagnostic for fields with multiple #[extract(...)] attributes.
  • Made attr and re compose: when both are present, the macro reads the attribute first and then applies the regex to that attribute value.
  • Updated kumo-derive README and crate metadata for the kumo 0.2 line.

0.1.2 — 2026-04-25

  • Added default = "value" — fallback string for String fields
  • Added transform = "trim|lowercase|uppercase" — post-extraction transform

0.1.1 — 2026-04-25

  • Added crate metadata: authors, rust-version, documentation, exclude

0.1.0 — 2026-04-21

  • #[derive(Extract)] proc-macro for structs with named fields
  • css, attr, re, text field options
  • llm_fallback — CSS-first with LLM fallback
  • String fields default to "", Option<String> to None