state.stage = "production-ready"_I design AI systems
that don't break.
The STATE framework is how.
Your GenAI system is not failing because of the model.
It is failing because of the plumbing around it.
Same input. Different outputs. No reproducible bugs.
Post-mortems end with the model did something weird. No root cause. No stack trace.
No traces. No per-user state. Flying blind.
Problems go unnoticed until users complain. Logging prompts and hoping is not observability.
Can we log why the agent did this? You cannot answer that.
Risk and legal need documentation. Law 25 requires it. Your system cannot answer in 30 minutes.
"Our GenAI stuff is basically a clever prototype duct-taped into production — non-deterministic, we cannot reproduce failures, and risk is breathing down our neck. I need a proper architecture for stateful, observable, auditable LLM systems so I stop betting my job on vibes."
— LLMOps Lead, Financial Services, Quebec
State Beats Intelligence.
A mid-tier model with proper state management beats a frontier model running stateless — every time.
Production-ready threshold
8–10
out of 10 STATE score
Structured
Explicit state schemas, not implicit context
Every operation initializes a typed state object. The stage field always reflects current execution position. If your agent crashed right now, could you look at the last saved state and know exactly where it stopped?
Traceable
Every step observable, every decision logged
Log every LLM call, external API call, and meaningful stage transition. You must be able to reconstruct exactly what the agent did, what it was given, and what it produced — for any execution, after the fact.
Auditable
Governance-ready, explainable under Law 25
For any automated decision affecting an individual, write a decision record. Quebec Law 25 requires it. So does OSFI. "Can we log why the agent did this?" must have a 30-minute answer.
Tolerant
Fault-tolerant and resumable after failure
When the workflow fails at step 6, it resumes from step 6 — not step 1. Lock before expensive operations. Clear lock on failure. If it only works moving forward, it's a demo.
Explicit
Deterministic boundaries, no magic
Every LLM output passes through a validation gate before any write or action. Invalid output routes to the error path — never silently continues. The seam between reasoning and action is always named.
You've been handed a GenAI platform.
Now you're accountable for reliability.
7–15 years in backend, data engineering, or SRE. Got pulled into GenAI platform ownership 1–2 years ago with an ambiguous mandate. Not an ML researcher. Came up through systems, not models.
Non-determinism in production
Same input, different outputs. Bugs cannot be reproduced.
No observability
No traces. No per-user state. Flying blind until users complain.
The compliance gap
Risk asks can we log why? Law 25 requires the answer.
Leadership pressure
100% feel pressure to ship GenAI. 90% say expectations are unrealistic.

Practitioner, not guru.
I came up through backend and systems engineering, got pulled into GenAI platform work, and spent too long debugging failures that had nothing to do with the model.
The STATE framework is how I stopped guessing and started shipping reliably. It's not a research paper — it's what I use on real systems in regulated environments.
Quebec City–based. Background in C#/.NET and distributed systems. Bilingual. Teaching what I learned the hard way.
Start with the quiz.
Build from there.
Three entry points, one destination: a GenAI system that does not break under production conditions.
STATE Readiness Quiz
Score your GenAI system against 5 production-readiness pillars. 10 questions. Concrete gaps, not vague advice. Know exactly where your system will fail before it does.
- 5-pillar diagnostic
- Instant scorecard with pillar breakdown
- Personalized fix plan by email
No Stack Trace
90 minutes. Live teardown of a real RAG architecture scored against the STATE framework. You'll leave with a reproducible debugging methodology, not just theory.
- Live architecture teardown
- STATE scoring exercise
- Reproducibility methodology
LLMOps Cohort
4 Weeks. Your System. Real Fixes.
A small cohort (10–12 engineers) working through the STATE framework on their actual production systems. You bring the system; we fix what's broken.
- 4 weekly live sessions
- Work on your real system
- Law 25 compliance module
Questions practitioners ask.
Most teams blame the model. The actual failure mode is almost always architectural: no state contract between steps so the agent loses context mid-workflow, no validation gate so hallucinated output flows downstream unchecked, no checkpointing so a crash at step 7 means starting over from step 1. The model isn't the weak link. The plumbing around it is. The STATE framework exists to diagnose exactly this.
LLMOps handles the deployment and evaluation layer — model hosting, versioning, prompt tracking, evals. AI Reliability Engineering is the architectural layer below that: the state contracts, full execution traceability, fault tolerance that resumes from where it crashed, and explicit validation gates that catch bad LLM output before it becomes a real-world action. You can run excellent LLMOps tooling on top of a system that still fails silently. That's the gap.
Observability isn't a dashboard. It's the ability to reconstruct, in full, what your agent did on any specific past execution — which prompt ran, which model version, what the output was at each step, where in the workflow it was when it failed or succeeded. Most teams log inputs and outputs. That tells you what went in and what came out. It doesn't tell you why the agent made the decision it did at step 4. If you can't answer that question for a run from last Tuesday, you don't have observability yet.
When a decision affecting an individual is made exclusively through automated processing, organizations must notify that person. On request, they must disclose the personal data used, the principal factors that influenced the decision, and the individual's right to human review. In practice this means your system needs structured decision records — not log files, but explicit records tying each outcome to the data and model version that produced it. Penalties reach C$10M or 2% of global revenue. This isn't a compliance checkbox. It's an architecture requirement your system either satisfies or it doesn't.
Only the Auditable pillar connects directly to regulatory requirements like Law 25 or OSFI. The other four — Structured, Traceable, Tolerant, Explicit — address engineering problems that affect every production AI system. Non-determinism, context rot, full restarts after partial failures, hallucinations flowing through unvalidated: these aren't compliance problems. They're production reliability problems. Regulation is one reason to build stateful, observable systems. The more common reason is that systems without these properties are expensive to debug and dangerous to trust.
Free · 5 minutes
Is your GenAI pilot
production-ready?
Score it against the STATE framework. Concrete gaps, not vague advice. Know exactly where your system will fail before your users do.