Lexly is live in production at lexlyai.com, generating legally binding
contracts across six jurisdictions in under two minutes. The product question is simple — type a
brief, get a contract. The engineering question, less so: how do you build an LLM pipeline a lawyer
will actually put their name next to, across legal frameworks that disagree about which clauses are
enforceable in which order?
This post is the architecture answer.
The framing: don't ask the model to invent the law
The mistake we kept seeing in audits before we started this build was treating contract generation
as a generation problem. Type intent, generate document. It produces plausible English. It does
not produce contracts a regulator will respect.
We framed it as a retrieval + assembly problem instead. The pipeline routes the user's brief
to a jurisdiction-specific clause library, retrieves the right clauses, and asks the model only to
translate the user's intent into a library the legal team has already vetted. The model is
never asked to invent a clause. The model is asked to choose, parameterise and fluentise.
1. Jurisdiction routing happens before the LLM sees anything
The first system in the pipeline is a deterministic router. Inputs: the user's jurisdiction, the
contract type, the parties' residencies, the law clauses they explicitly opted into. Output: a
library namespace — e.g. eIDAS/employment/freelance/v2.3.1 — that pins the pipeline to a
specific clause set.
Six jurisdictions live today: ESIGN Act (US), eIDAS (EU), UK e-signature regs, IT Act 2000 (India),
ICA 1872 (India), ETA (Singapore). Each has its own clause library, its own validation rules, its
own e-signature compliance constraints.
Adding a new jurisdiction is a content change, not a code change. We ship YAML.
2. Retrieval is hybrid, and the index is structured
Each clause library is structured — not free-text. Every clause is an object:
type Clause = {
id: string; // e.g. "noncompete/employee/standard/v1.4"
title: string;
body: string; // legal text
jurisdictionTags: string[];
applicableTo: ClauseContext[];
variables: VariableDecl[]; // ${employee_name}, ${term_months}, ...
conflicts: string[]; // other clause ids that conflict
requires: string[]; // other clause ids that must be included
};
The retrieval layer is hybrid: dense vectors (bge-large embeddings) for the "what is this about"
question, BM25 for exact regulatory references, and a structured filter on jurisdiction tags. A
re-ranker scores the top 30 candidates against the brief.
Dense-only retrieval failed the long tail of regulatory edge cases. Hybrid recovered them.
3. The LLM call is structured output, not free text
The model never returns prose. It returns a structured ContractDraft with selected_clauses,
variable_assignments, and an argument field explaining the choice. Schema-validated with Zod
before it hits the next stage.
const ContractDraft = z.object({
selected_clauses: z.array(z.string()), // clause ids, must exist in the library
variable_assignments: z.record(z.string()), // ${employee_name} → "Jane Singh"
argument: z.string().min(40).max(800), // human-readable reasoning
open_questions: z.array(z.string()), // anything the model wants the user to confirm
});
If validation fails, the model gets one re-try with the schema error injected as feedback. Two
failures escalate to a human review surface. We track the rate. It currently sits below 1.4%.
4. Assembly is deterministic
Once the structured output validates, the assembly layer takes over. It looks up each selected_clauses
id in the library, substitutes variables, runs the conflict checker, and emits the final document.
No prose is generated outside the clause library.
This is the step that means a lawyer can audit the output. Every clause in the final contract has a
provenance: a library id, a version, the human author who vetted it. The risk score is derived from
the clause metadata, not generated.
5. The eval harness runs against historical contracts
Every PR runs the eval. The dataset is 220 historical contracts spanning all six jurisdictions, each
labelled by an external lawyer with the correct clauses and risk flags. The harness scores three
dimensions:
- Clause selection — did we select the right ids?
- Variable accuracy — did we fill the right fields with the right values?
- Risk surface — did we surface every flag a human reviewer would?
A regression on any of these is a red build. New jurisdictions can't ship until they hit 92%+ on
clause selection and 100% on the regulatory must-includes.
The combined effect: a system where the engineering team can ship a new jurisdiction in a sprint, a
lawyer can audit the output in minutes, and the user gets a contract that holds up under real
scrutiny. The retrieval-and-assembly framing is what makes all three true at once.
Want the full case study? Read the Lexly engagement →.