Inside Lexly: jurisdiction-aware contract generation with structured outputs

Lexly is live in production at lexlyai.com, generating legally binding contracts across six jurisdictions in under two minutes. The product question is simple — type a brief, get a contract. The engineering question, less so: how do you build an LLM pipeline a lawyer will actually put their name next to, across legal frameworks that disagree about which clauses are enforceable in which order?

This post is the architecture answer.

The framing: don't ask the model to invent the law

The mistake we kept seeing in audits before we started this build was treating contract generation as a generation problem. Type intent, generate document. It produces plausible English. It does not produce contracts a regulator will respect.

We framed it as a retrieval + assembly problem instead. The pipeline routes the user's brief to a jurisdiction-specific clause library, retrieves the right clauses, and asks the model only to translate the user's intent into a library the legal team has already vetted. The model is never asked to invent a clause. The model is asked to choose, parameterise and fluentise.

1. Jurisdiction routing happens before the LLM sees anything

The first system in the pipeline is a deterministic router. Inputs: the user's jurisdiction, the contract type, the parties' residencies, the law clauses they explicitly opted into. Output: a library namespace — e.g. eIDAS/employment/freelance/v2.3.1 — that pins the pipeline to a specific clause set.

Six jurisdictions live today: ESIGN Act (US), eIDAS (EU), UK e-signature regs, IT Act 2000 (India), ICA 1872 (India), ETA (Singapore). Each has its own clause library, its own validation rules, its own e-signature compliance constraints.

Adding a new jurisdiction is a content change, not a code change. We ship YAML.

2. Retrieval is hybrid, and the index is structured

Each clause library is structured — not free-text. Every clause is an object:

type Clause = {
  id: string;          // e.g. "noncompete/employee/standard/v1.4"
  title: string;
  body: string;        // legal text
  jurisdictionTags: string[];
  applicableTo: ClauseContext[];
  variables: VariableDecl[];   // ${employee_name}, ${term_months}, ...
  conflicts: string[];          // other clause ids that conflict
  requires: string[];           // other clause ids that must be included
};

The retrieval layer is hybrid: dense vectors (bge-large embeddings) for the "what is this about" question, BM25 for exact regulatory references, and a structured filter on jurisdiction tags. A re-ranker scores the top 30 candidates against the brief.

Dense-only retrieval failed the long tail of regulatory edge cases. Hybrid recovered them.

3. The LLM call is structured output, not free text

The model never returns prose. It returns a structured ContractDraft with selected_clauses, variable_assignments, and an argument field explaining the choice. Schema-validated with Zod before it hits the next stage.

const ContractDraft = z.object({
  selected_clauses: z.array(z.string()),       // clause ids, must exist in the library
  variable_assignments: z.record(z.string()),  // ${employee_name} → "Jane Singh"
  argument: z.string().min(40).max(800),       // human-readable reasoning
  open_questions: z.array(z.string()),         // anything the model wants the user to confirm
});

If validation fails, the model gets one re-try with the schema error injected as feedback. Two failures escalate to a human review surface. We track the rate. It currently sits below 1.4%.

4. Assembly is deterministic

Once the structured output validates, the assembly layer takes over. It looks up each selected_clauses id in the library, substitutes variables, runs the conflict checker, and emits the final document. No prose is generated outside the clause library.

This is the step that means a lawyer can audit the output. Every clause in the final contract has a provenance: a library id, a version, the human author who vetted it. The risk score is derived from the clause metadata, not generated.

5. The eval harness runs against historical contracts

Every PR runs the eval. The dataset is 220 historical contracts spanning all six jurisdictions, each labelled by an external lawyer with the correct clauses and risk flags. The harness scores three dimensions:

Clause selection — did we select the right ids?
Variable accuracy — did we fill the right fields with the right values?
Risk surface — did we surface every flag a human reviewer would?

A regression on any of these is a red build. New jurisdictions can't ship until they hit 92%+ on clause selection and 100% on the regulatory must-includes.

The combined effect: a system where the engineering team can ship a new jurisdiction in a sprint, a lawyer can audit the output in minutes, and the user gets a contract that holds up under real scrutiny. The retrieval-and-assembly framing is what makes all three true at once.

Want the full case study? Read the Lexly engagement →.