The first LLM feature every team ships is a thin wrapper: user text in, model text out, a little prompt engineering in between. It demos beautifully. Then real users arrive with their typos, their adversarial creativity, and their unerring instinct for the input you never tested — and the wrapper collapses. What replaces it, on every system we've hardened, is the same four-layer architecture.
Layer one: retrieval
The model's parametric knowledge is stale the day it ships and was never grounded in your data to begin with. Retrieval fixes both — but treat it as an information-retrieval engineering problem, not a vector-database purchase. Chunking strategy, hybrid lexical-semantic search, reranking, and freshness pipelines determine answer quality far more than model choice does. Every hallucination complaint we've investigated traced back to retrieval serving the model weak context, not to the model inventing freely.
Layer two: guardrails
Guardrails run on both sides of the model. Inbound: injection detection, topic boundaries, PII scrubbing. Outbound: schema validation, claim-checking against retrieved sources, brand and safety filters. The critical property is that guardrails are deterministic code, not more prompting — "please don't reveal the system prompt" is a suggestion; a regex over the output is a control.
Layer three: evals
You cannot unit-test a probabilistic system with three golden examples. Production LLM systems need an eval suite the way conventional software needs CI: hundreds of cases spanning the happy path, edge cases, and known attacks, scored automatically on every prompt change, retrieval tweak, and model upgrade. Teams that skip this layer discover regressions the way you never want to — from users, after the deploy that "only changed the prompt a little."
Layer four: fallbacks
The model will time out, the provider will have an incident, and confidence will sometimes be too low to act on. Every LLM call in a serious system carries a fallback chain: a cheaper model, a cached response, a template, or — honourably — a handoff to a human with full context. Systems fail; the architecture decides whether users notice.
Four layers, none optional. The wrapper was never the product — the system around it is.