Inside TopPrix: a structured ingestion pipeline for weekly retailer flyers

TopPrix.re is the deal-discovery platform we built with the TopPrix team for Réunion Island. Bilingual, retailer-managed, growing weekly. From the consumer side it looks like a clean search-and-browse surface over discounted products. From the engineering side, the real product is the retailer-onboarding pipeline that keeps the consumer side fresh without any human in the loop.

This post is about that pipeline.

The framing: ingestion is the product

We've audited a lot of deal aggregators. The consumer-facing search is always the polished surface. The supply side is always the soft underbelly — a scraping cron, an OCR script someone wrote during the MVP, a Slack channel where the founder pastes screenshots of PDFs at 11pm.

This works at 20 retailers. It collapses at 200.

So the first architectural decision we made on TopPrix was that the retailer portal is the product, and the ingestion pipeline is a first-class system with its own SLOs, observability and human-review surface.

1. Two ingestion paths, depending on retailer maturity

Some retailers (the national chains) ship a weekly structured deals feed — CSV, JSON, or a feed that resembles one if you squint. We integrate those directly.

Most retailers don't. They ship a PDF flyer or a JPEG of the printed weekly newsletter, often via WhatsApp. For those, the ingestion pipeline does the work.

Path A — structured feed. Validate against schema → normalise units (FR locale, EUR) → deduplicate against last week's items → write to canonical deal table → emit deal.ingested event.

Path B — flyer upload. OCR (Tesseract + a custom layout parser) → spatial parsing to group price/product/discount blocks → LLM-assisted extraction with a structured output schema → confidence scoring per item → high-confidence items go straight to the canonical table; low-confidence items queue for human review in the retailer portal.

The pipeline reports its own per-step accuracy in a dashboard the team checks every Monday.

2. The canonical deal model

Every successfully ingested item lands as a Deal:

type Deal = {
  id: string;
  retailer_id: string;
  product: { name: string; brand?: string; category: CategoryId; image?: string };
  price: { current: number; original?: number; currency: "EUR" };
  discount: { kind: "percent" | "flat" | "bogo" | "bundle"; value: number };
  validity: { from: ISODate; to: ISODate };
  flyer_id?: string;           // origin flyer for provenance
  source_confidence: number;   // 0..1 from the extractor
  locales: { fr: Localised; en: Localised };
};

That locales field is the source of TopPrix's bilingual-by-default behaviour — every deal carries both languages from the moment it's ingested. EN is generated from FR (the source locale) via a small translation pass with brand and product-name dictionary lookups to avoid the kind of bad translations that make a deals site look amateur.

3. The retailer portal is the front door

The portal is the part most aggregators don't ship. Retailers can:

Upload this week's flyer (PDF or image).
See the parser's output, with confidence flags.
Correct anything wrong, in a single screen, before publishing.
Schedule a promotion to start at a specific time.
See their items' engagement — views, click-throughs, top performers.

It's deliberately not pretty. It's deliberately fast. A retailer in Saint-Pierre on a Sunday night gets their flyer up in seven minutes flat.

We measured this — internal time-to-publish across all retailers averages 6m 22s, down from 35m+ on the first iteration that used a generic admin tool.

4. Observability across the pipeline

Every step of the ingestion pipeline emits a trace, and the team has one Grafana board:

Retailers active this week.
Items ingested per retailer, per day.
High-confidence vs review-required item ratio.
End-to-end latency from upload to live.
Per-retailer error rate, with the actual error.

A retailer's flyer that suddenly drops in parse quality (logo changed, layout updated, fonts swapped) is the team's signal to update the layout parser for that retailer — not the consumer's problem.

5. The numbers

240+ retailers indexed.
~6m 22s average retailer time-to-publish.
91% of retailer-uploaded items reach live without human review.
0 weekly ingestion failures since the third week of production.

The consumer search experience is what the user sees. The retailer portal is what the business runs on.

Read the TopPrix engagement → for the full case study.