#[derive(Extract)]
kumo-derive is a companion proc-macro crate that generates an Extract implementation for your item structs. The main kumo prelude exports the derive macro as Extract to avoid colliding with the Extract trait it implements.
Installation
The derive feature automatically pulls in kumo-derive.
Basic Usage
use kumo::prelude::*;
use serde::Serialize;
#[derive(Debug, Serialize, Extract)]
struct Book {
#[extract(css = "h3 a", attr = "title")]
title: String,
#[extract(css = ".price_color", re = r"[\d.]+")]
price: f64,
#[extract(css = ".availability")]
availability: String,
}
// In parse():
async fn parse(&self, res: &Response) -> Result<Output<Self::Item>, KumoError> {
let mut books = Vec::new();
for el in res.css("article.product_pod").iter() {
books.push(Book::extract_from(el, None).await?);
}
Ok(Output::new().items(books))
}
Field Options
All options are set inside #[extract(...)] on each field.
css (required)
The CSS selector to find the element. Must be present on every field.
attr
Extract an HTML attribute instead of the text content.
#[extract(css = "a.product-link", attr = "href")]
url: String,
#[extract(css = "img.thumbnail", attr = "src")]
image_url: String,
re
Apply a regex to the extracted text and return the first match or capture group.
// Extract digits from "£12.99"
#[extract(css = ".price_color", re = r"[\d.]+")]
price_value: String,
// First capture group
#[extract(css = ".rating", re = r"star-rating (\w+)")]
rating_word: String,
text
Explicit text extraction — this is the default and can be omitted.
default
Fallback value for required scalar fields when the selector finds nothing.
Without default, missing String fields fall back to an empty string. Missing required numeric or boolean fields return a KumoError. Option<T> fields always use None, and Vec<T> fields use an empty vector.
transform
Apply a string transformation after extraction. Valid values: "trim", "lowercase", "uppercase".
#[extract(css = ".category", transform = "lowercase")]
category: String,
#[extract(css = "h1", transform = "trim")]
title: String,
llm_fallback
Fall back to LLM extraction when the CSS selector returns empty. Two forms:
// Use a custom hint
#[extract(css = ".price", llm_fallback = "the product price including currency symbol")]
price: String,
// Use the field name as the hint
#[extract(css = ".author-name", llm_fallback)]
author: String,
When any llm_fallback field is empty after CSS extraction, kumo calls the LLM with a generated JSON schema and fills in the missing fields. Requires a LLM client to be passed:
let client = AnthropicClient::new(std::env::var("ANTHROPIC_API_KEY")?);
let book = Book::extract_from(&el, Some(&client)).await?;
llm_fallback only supports single-value fields. It is forbidden on Vec<T> fields and cannot be combined with default; fallback chains are not supported.
Field Types
| Type | Behaviour when selector finds nothing |
|---|---|
String | Returns "" (or default value if set) |
numeric scalars (u32, f64, etc.) | Returns an extraction error unless default is set |
bool | Returns an extraction error unless default is set |
Option<T> | Returns None |
Vec<T> | Returns an empty vector |
Supported numeric scalars are i8, i16, i32, i64, i128, isize, u8, u16, u32, u64, u128, usize, f32, and f64. For Option<T> and Vec<T>, T can be String, bool, or one of those numeric scalar types. Invalid scalar parses return KumoError messages that include the field name, target type, raw value, and parse failure.
Types may use their Rust prelude spelling or canonical std, core, and alloc paths, such as String, std::string::String, u32, core::primitive::u32, and std::option::Option<T>. Custom paths and nested containers such as Option<Vec<T>> are not supported.
Combining Options
Options can be combined on a single field:
#[derive(Debug, Serialize, Extract)]
struct Product {
// attribute + regex + transform
#[extract(css = "span.price", attr = "data-raw", re = r"[\d.]+", transform = "trim")]
price: f64,
// optional field with CSS fallback to LLM
#[extract(css = "div.description", llm_fallback = "product description")]
description: Option<String>,
// attribute with default
#[extract(css = "a.detail-link", attr = "href", default = "#")]
detail_url: String,
// collects every matching label
#[extract(css = ".tag", transform = "lowercase")]
tags: Vec<String>,
}
Struct Requirements
- Only structs with named fields are supported — tuple structs and enums will produce a compile error.
- Every field must have an
#[extract(css = "...")]annotation — fields without it won't compile. - The struct must also derive
serde::Serializeto work as a kumo item.