TopPrix.re is the deal-discovery platform we built with the TopPrix team
for Réunion Island. Bilingual, retailer-managed, growing weekly. From the consumer side it looks
like a clean search-and-browse surface over discounted products. From the engineering side, the
real product is the retailer-onboarding pipeline that keeps the consumer side fresh without any
human in the loop.
This post is about that pipeline.
The framing: ingestion is the product
We've audited a lot of deal aggregators. The consumer-facing search is always the polished
surface. The supply side is always the soft underbelly — a scraping cron, an OCR script someone
wrote during the MVP, a Slack channel where the founder pastes screenshots of PDFs at 11pm.
This works at 20 retailers. It collapses at 200.
So the first architectural decision we made on TopPrix was that the retailer portal is the
product, and the ingestion pipeline is a first-class system with its own SLOs, observability and
human-review surface.
1. Two ingestion paths, depending on retailer maturity
Some retailers (the national chains) ship a weekly structured deals feed — CSV, JSON, or a feed
that resembles one if you squint. We integrate those directly.
Most retailers don't. They ship a PDF flyer or a JPEG of the printed weekly newsletter, often via
WhatsApp. For those, the ingestion pipeline does the work.
Path A — structured feed. Validate against schema → normalise units (FR locale, EUR) →
deduplicate against last week's items → write to canonical deal table → emit deal.ingested
event.
Path B — flyer upload. OCR (Tesseract + a custom layout parser) → spatial parsing to
group price/product/discount blocks → LLM-assisted extraction with a structured output schema
→ confidence scoring per item → high-confidence items go straight to the canonical table;
low-confidence items queue for human review in the retailer portal.
The pipeline reports its own per-step accuracy in a dashboard the team checks every Monday.
2. The canonical deal model
Every successfully ingested item lands as a Deal:
type Deal = {
id: string;
retailer_id: string;
product: { name: string; brand?: string; category: CategoryId; image?: string };
price: { current: number; original?: number; currency: "EUR" };
discount: { kind: "percent" | "flat" | "bogo" | "bundle"; value: number };
validity: { from: ISODate; to: ISODate };
flyer_id?: string; // origin flyer for provenance
source_confidence: number; // 0..1 from the extractor
locales: { fr: Localised; en: Localised };
};
That locales field is the source of TopPrix's bilingual-by-default behaviour — every
deal carries both languages from the moment it's ingested. EN is generated from FR (the source
locale) via a small translation pass with brand and product-name dictionary lookups to avoid
the kind of bad translations that make a deals site look amateur.
3. The retailer portal is the front door
The portal is the part most aggregators don't ship. Retailers can:
- Upload this week's flyer (PDF or image).
- See the parser's output, with confidence flags.
- Correct anything wrong, in a single screen, before publishing.
- Schedule a promotion to start at a specific time.
- See their items' engagement — views, click-throughs, top performers.
It's deliberately not pretty. It's deliberately fast. A retailer in Saint-Pierre on a Sunday night
gets their flyer up in seven minutes flat.
We measured this — internal time-to-publish across all retailers averages 6m 22s, down from 35m+
on the first iteration that used a generic admin tool.
4. Observability across the pipeline
Every step of the ingestion pipeline emits a trace, and the team has one Grafana board:
- Retailers active this week.
- Items ingested per retailer, per day.
- High-confidence vs review-required item ratio.
- End-to-end latency from upload to live.
- Per-retailer error rate, with the actual error.
A retailer's flyer that suddenly drops in parse quality (logo changed, layout updated, fonts
swapped) is the team's signal to update the layout parser for that retailer — not the consumer's
problem.
5. The numbers
- 240+ retailers indexed.
- ~6m 22s average retailer time-to-publish.
- 91% of retailer-uploaded items reach
live without human review.
- 0 weekly ingestion failures since the third week of production.
The consumer search experience is what the user sees. The retailer portal is what the business
runs on.
Read the TopPrix engagement → for the full case study.