Reliability Engineering for AI Systems

Published June 2023

AI reliability is not only model evaluation.

A model can pass an evaluation and still fail in an operating environment. It can be accurate enough in isolation and unreliable once connected to retrieval, tools, permissions, downstream systems, human review, and changing source material.

Reliability engineering for AI systems has to treat the model as one component in a larger service.

Google's SRE material is useful because it defines reliability around user-visible service behavior and asks teams to make reliability explicit through objectives, monitoring, risk tolerance, and operational practice. AI systems need the same discipline, with additional attention to output quality and decision boundaries. Google SRE: Service Level Objectives

Define The Boundary First

The first reliability question is not "Which model is best?"

The first question is: where does the system become operationally consequential?

Examples:

drafting text a human edits before use
summarizing documents for internal review
routing cases between teams
retrieving policy context for a decision
recommending an operational action
triggering a downstream workflow

Each boundary has a different tolerance for error. A draft can be wrong in ways that a routed case cannot. A search assistant can be incomplete in ways that a system-triggered action cannot.

The reliability model has to match the boundary.

Measure The Right Failure Modes

AI systems need conventional service monitoring: availability, latency, capacity, and error rates. They also need workflow-level signals:

output rejection rate
human correction rate
retrieval failure rate
escalation frequency
source conflict frequency
unauthorized tool-use attempts
automation rollback events
incidents where the service was available but the answer was unsafe or unusable

Google's guidance on monitoring distributed systems distinguishes symptoms from causes. That distinction matters for AI systems. A symptom may be "operators no longer trust summaries." The cause may be stale retrieval sources, prompt drift, missing review criteria, or a model change. Google SRE: Monitoring Distributed Systems

Build For Controlled Degradation

Reliable AI systems should fail down, not outward.

If confidence is low, route to review. If retrieval conflicts, show the conflict rather than hiding it. If a tool call is outside scope, block it. If a model update changes behavior, roll back. If logging fails, pause high-impact automation. If a source is stale, remove it from the retrieval path.

This is ordinary reliability thinking applied to probabilistic behavior.

NIST SP 800-61 Rev. 3 is relevant because incident response requires preparation, response, recovery, and learning. AI incidents need the same loop. The organization should know what constitutes an AI operational incident before one occurs. NIST SP 800-61 Rev. 3

Reliability Requires Governance

AI reliability cannot be delegated entirely to the engineering team.

Operators define what bad output looks like. Security defines unacceptable access. Legal and compliance define review-sensitive boundaries. Leadership defines acceptable risk. Engineering designs the controls. The reliability model has to join those inputs.

OWASP's LLM application guidance is useful because it locates many AI risks in the application and integration layer: prompt injection, excessive agency, insecure output handling, and overreliance all become reliability concerns once the system is deployed. OWASP Top 10 for LLM Applications

Reliability engineering for AI systems is not a dashboard. It is a control model around a service that can be available and wrong at the same time.

Reliability Engineering for AI Systems

Define The Boundary First

Measure The Right Failure Modes

Build For Controlled Degradation

Reliability Requires Governance

Sources