CostBot: Crawling Construction Prices on Cloudflare Without Letting Costs Run Away

CostBot's Core Principle

The goal is the cheapest crawl that still improves price coverage — not the most frequent one.

Price intelligence only works if it stays current. For construction that means crawling supplier pages, public catalogs, procurement portals, and cost files on a schedule — and doing it cheaply enough that a data business can scale before revenue catches up. We call our crawler CostBot, and cost discipline is baked into its design.

Why run CostBot on Cloudflare?

The workload is naturally distributed, so we run it on Cloudflare primitives that each do one job well:

Workers orchestrate the pipeline.
Queues absorb crawl and normalization jobs so spikes don't topple anything.
Cron triggers dispatch freshness checks on a schedule.
KV holds guardrail and kill-switch state that has to be read cheaply and often.
D1 stores the structured catalog and observations.
R2 holds downloaded artifacts (pages, PDFs, BC3 files).
Workers AI handles selective matching and enrichment — only when cheaper methods fail.

Why is cost the real constraint, not coverage?

The instinct with crawling is "fetch everything, often." That's how a data pipeline quietly turns into a five-figure monthly bill. CostBot inverts the default: the goal is the cheapest crawl that still improves price coverage.

Three rules do most of the work:

Not every page deserves a browser render. Static HTML and JSON endpoints are parsed directly; expensive rendered crawls are reserved for sources that actually need them.
Not every observation needs an AI call. Deterministic parsing and matching run first; the model is the last resort, not the first.
Not every source needs the same freshness. Sources are ranked by value, crawl budget, trust, and staleness, so a volatile, high-value supplier is checked far more often than a stable, marginal one.

How does a staged pipeline keep crawling cheap?

Work flows through discrete stages so each can be cheap, retried, and reasoned about:

Source discovery finds promising suppliers and BC3 files.
Crawlers collect raw observations.
Parsers extract prices from the raw bytes.
Normalizers convert messy rows into structured observations (units, currency, region).
Matchers map observations to canonical items — deterministic rules first, AI only when the cheaper path can't decide.

How do guardrails keep AI spend from spiraling?

Every AI call sits behind kill switches, per-job budgets, and usage ledgers. If a model is rate-limited or a daily budget is exhausted, the system degrades gracefully — it skips enrichment and keeps the deterministic pipeline running — instead of retrying into a runaway bill. A spend cap that actually stops work is worth more than any clever model, because the failure mode you can't afford is the silent one.

Why is cost control a product feature?

This lets Omnicost build a live catalog without behaving like an enterprise data platform on day one. When the product itself depends on continuous data collection, tight cost discipline isn't ops hygiene — it's a product feature. It's what makes "always current" financially possible.

Degrade gracefully

A spend cap that actually stops work is worth more than any clever model — because the failure mode you can't afford is the silent one.

See how CostBot's cost discipline translates into predictable pricing for your own construction data pipeline.

Try the free estimator