AIOps spent years as a slideware category — anomaly detection that paged a human slightly earlier than a threshold alert would have. Useful, but hardly transformative. What has changed recently is the closing of the loop: systems that don't just detect and diagnose, but act.
Three incidents, zero pages
A retail client's payment service began leaking connections at 2:41 on a Sunday morning — a slow leak that would have breached the pool around 5 a.m., mid-peak for their overseas market. The platform correlated the leak with a deployment nine hours earlier, matched the signature to a known failure mode, rolled the service back to the previous build, and opened a ticket with the full evidence chain attached. The on-call engineer read about it over breakfast.
At a logistics company, a Kafka consumer group fell behind after a malformed message batch triggered repeated deserialisation retries. The system quarantined the poison messages to a dead-letter topic, scaled the consumer group, and restored lag to baseline in eleven minutes — a sequence that had previously taken a human operator most of an hour, assuming they were awake.
The third case was cost rather than availability: an autoscaling misconfiguration left a GPU fleet running at 4% utilisation overnight. The platform flagged it, computed the burn, and — because the fleet was tagged non-production — descheduled it automatically, saving roughly $40,000 before anyone logged in.
What makes it work
Three ingredients separate these outcomes from AIOps theatre. First, a real dependency graph — the system has to know what talks to what, or diagnosis is guesswork. Second, a library of reversible actions with explicit blast-radius limits; the platform earns wider authority the way a junior engineer does, by being right repeatedly under supervision. Third, full auditability — every automated action carries the evidence that justified it, which is what turns "the machine did something at 3 a.m." from terrifying to trustworthy.
Self-healing infrastructure isn't the removal of humans from operations. It's the removal of humans from the 80% of incidents that are pattern-matched repetitions — so the humans are fresh for the 20% that are genuinely new.