Reliability Engineering for AI Systems
Published June 2023
AI reliability is not only model evaluation.
A model can pass an evaluation and still fail in an operating environment. It can be accurate enough in isolation and unreliable once connected to retrieval, tools, permissions, downstream systems, human review, and changing source material.
Reliability engineering for AI systems has to treat the model as one component in a larger service.
Google's SRE material is useful because it defines reliability around user-visible service behavior and asks teams to make reliability explicit through objectives, monitoring, risk tolerance, and operational practice. AI systems need the same discipline, with additional attention to output quality and decision boundaries. Google SRE: Service Level Objectives
Define The Boundary First
The first reliability question is not "Which model is best?"
The first question is: where does the system become operationally consequential?
Examples:
- drafting text a human edits before use
- summarizing documents for internal review
- routing cases between teams
- retrieving policy context for a decision
- recommending an operational action
- triggering a downstream workflow
Each boundary has a different tolerance for error. A draft can be wrong in ways that a routed case cannot. A search assistant can be incomplete in ways that a system-triggered action cannot.
The reliability model has to match the boundary.
Measure The Right Failure Modes
AI systems need conventional service monitoring: availability, latency, capacity, and error rates. They also need workflow-level signals:
- output rejection rate
- human correction rate
- retrieval failure rate
- escalation frequency
- source conflict frequency
- unauthorized tool-use attempts
- automation rollback events
- incidents where the service was available but the answer was unsafe or unusable
Google's guidance on monitoring distributed systems distinguishes symptoms from causes. That distinction matters for AI systems. A symptom may be "operators no longer trust summaries." The cause may be stale retrieval sources, prompt drift, missing review criteria, or a model change. Google SRE: Monitoring Distributed Systems
Build For Controlled Degradation
Reliable AI systems should fail down, not outward.
If confidence is low, route to review. If retrieval conflicts, show the conflict rather than hiding it. If a tool call is outside scope, block it. If a model update changes behavior, roll back. If logging fails, pause high-impact automation. If a source is stale, remove it from the retrieval path.
This is ordinary reliability thinking applied to probabilistic behavior.
NIST SP 800-61 Rev. 3 is relevant because incident response requires preparation, response, recovery, and learning. AI incidents need the same loop. The organization should know what constitutes an AI operational incident before one occurs. NIST SP 800-61 Rev. 3
Reliability Requires Governance
AI reliability cannot be delegated entirely to the engineering team.
Operators define what bad output looks like. Security defines unacceptable access. Legal and compliance define review-sensitive boundaries. Leadership defines acceptable risk. Engineering designs the controls. The reliability model has to join those inputs.
OWASP's LLM application guidance is useful because it locates many AI risks in the application and integration layer: prompt injection, excessive agency, insecure output handling, and overreliance all become reliability concerns once the system is deployed. OWASP Top 10 for LLM Applications
Reliability engineering for AI systems is not a dashboard. It is a control model around a service that can be available and wrong at the same time.