How AIOps helps simplify observability and makes data intuitive to use

In partnership with Elastic

In today’s digital economy, observability is no longer a “nice-to-have” monitoring function; it is the architectural mandate for business resilience.

From e-commerce sites to ride-hailing services, smooth digital transactions depend on a vast, unseen web of complex infrastructure. When any part of this infrastructure fails, the consequences are immediate. A minor mistake can harm user experience and drive significant business costs.

By gathering and analysing logs, metrics, and traces, observability provides a comprehensive view of application behaviour on a unified dashboard, enabling IT teams to quickly identify and resolve issues.

Putting agentic AI in observability for AIOps

In Asia-Pacific, businesses are leveraging a combination of agentic AI and observability capabilities. Artificial Intelligence for IT Operations (AIOps) automates IT processes, including anomaly detection, event correlation, ingestion, and processing of operational data by leveraging big data and machine learning.

Over 77 per cent of Asia-Pacific organisations recognise digital infrastructure as critical, driving the adoption of modern cloud architectures with AIOps to optimise infrastructure, applications, networks, and costs, according to research firm IDC.

As pressure piles on businesses to deliver optimal user experiences without hiccups, the need for observability has become top priority.

Every new application that is shipped needs to be observable to ensure its uptime, and between the mass of IT systems and the multiple tools used to monitor them, there’s a need to simplify.

AIOps enables a proactive approach to a practice which has up till now been reactive. Instead of scrambling to find the root cause of a problem after it happens, IT teams can leverage AI and compare system performance across periods of time to correlate and identify issues before they occur.

For example, an AI agent augmenting observability workloads could correlate the performance of an application based on the traffic it sees on an average day, as compared to during a sales event.

It can then identify any potential issues and recommend solutions, like provisioning more resources on the cloud to support anticipated demand spikes before they cause disruption..

“The complexity of IT infrastructure is growing with today’s pace of technological innovation,” said Chris Walker, vice-president for solutions architecture for Asia-Pacific and Japan at Elastic.

“If you keep monitoring things on your own, with human operators alone, it’s going to be a nightmare trying to keep up,” he added. “That’s why AI and automation are so important, they can gather insights from terabytes of data, and explain those insights in natural language.”

It also helps to lighten workloads on IT teams, reduce alert fatigue and enable pre-emptive resolution of issues before they cause a major disruption as opposed to the reactive approach, where customers are already impacted before the root cause of an issue is found.

The barriers to successfully implementing AIOps

Successful implementation of AIOps requires more than new tooling. It demands a birds-eye view of system logs, metrics and traces, the right platform foundation, and stakeholder buy-in for a seamless rollout.

A team would leverage a tool for every purpose, one to monitor logs from an app, and another to check in on cloud server performance. Such tools compound quickly across different teams across the organisation, resulting in tool sprawl.

An example of tool sprawl is when an enterprise uses multiple security information and event management (SIEM) tools to keep track of different parts of its infrastructure. Problems arise when these tools don’t ‘talk’ to one another to give a coherent picture, said Walker.

“There are businesses with as many as 25 tools monitoring different parts of the infrastructure”, he noted. “How do you co-relate their findings, to detect an issue, then diagnose it and find a resolution?”

At the same time, organisations tend to work with data stored in siloes, with a sprawl of tools used by different teams to monitor performance based on different sets of data.

For accurate, relevant output from AI agents, context is key. An agent needs access to historical performance data across the organisation to distinguish “normal” from “anomalous” behaviour, and siloed data undermines that entirely.

AIOps – consolidating on a single platform

For starters, enterprises should consider unifying the tools they use, starting with how they store and access data, advised Walker.

As each tool in use nears renewal, it’s a chance to consolidate. Organisations need to shift away from legacy “data hoarding” in isolated siloes and toward a unified data platform capable of ingesting and retrieving an exponentially growing amount of performance data across an entire organisation for AI analysis, said Walker.

Depending on how complex the infrastructure is, the integration of AIOps in an enterprise could take months or even more than a year, he added.

“You don’t switch everything off at once, but you plan the migration to a certain platform and start small to get key wins,” he noted. “The eventual goal is to set up a Centre of Excellence for that platform governing processes, procedures, people, and their skillsets so organisations can fully leverage the potential of their data.”

The Elasticsearch Platform for example is capable of ingesting and understanding structured and unstructured data like documents, pictures, in addition to system logs. The platform also provides capabilities that address security and observability needs.

Elastic also offers the Elastic AI Assistant, which integrates with large language models (LLMs) to improve accuracy. Data can be processed and ranked for relevance within Elastic before being passed to the model, reducing token usage and cost while keeping data confidential.

Trust is a focus point of AI implementation, especially as the focus moves to AI agents. The successful implementation of AI agents hinges on the data an organisation can provide that is pertinent to a specific task.

While reactive, human-led troubleshooting is replaced with proactive, pre-approved automated playbooks that can identify common incidents before they impact the end-user, vetting output and the actual remediation can still be executed or approved by a human.

With frontier models like Claude Mythos on the horizon, Walker believes that such models will bring a net benefit for resilience.

“Frontier models will be better able to connect and correlate data from across the organisation to effectively reduce siloes, and the site reliability engineer (SRE) role will evolve again,” he said.

The resilience playbook needs to be rewritten from having a human respond, to having the human in the loop to approve AI agents’ actions, then to having the human ‘on’ the loop as AI automates tasks.

Rapidly increasing digital footprints and IT infrastructure mean that developments happen at machine speed at massive scale; and operations playbooks written with the human in mind are simply too slow.

The new role of SREs will shift to a strategic one, building out systems and architecture that enable complete visibility across an organisation, and continuously improving playbooks that guide AI agents in observing, identifying, and detecting problems.

“Nirvana might be when we get a fully automated solution, from root cause analysis to remediation without a human touch,” said Walker. “AI excels at extracting insights from the right data, but the human in the loop is still a critical factor to ensure that actions taken are sensible, responsible, and within a set of guardrails.”

With the immense amount of data at stake, security and compliance are crucial and needs to be baked in with robust testing, he stressed.

The company has experience helping financial institutions and government agencies across the world, navigating highly regulated industries to incorporate AIOps into the IT operations of these organisations.

With regulated sectors, there may be more rigour needed in testing new systems to ensure everything works fine before making the transition, he noted.

Like with the first wave of digital transformation and the move to the cloud, there are always risks involved when innovating new products – but the risk is something an organisation needs to measure and take. Otherwise, they will stagnate.

“Organisations will need to fully leverage technology and make sense of the huge amount of data it produces daily,” said Walker.

“One realistic use case is ensuring maximum system uptime via observability,” he stressed. “AIOps will be the tie-breaker between a smooth customer experience that builds loyalty, and one that drives customers away.”

Find out how Elastic’s AIOps solutions can propel your enterprise forward by proactively finding and resolving issues here.

How AIOps helps simplify observability and makes data intuitive to use

Leave a ReplyCancel reply

Stay Connected

Latest News

Q&A: Biometrics and AI boost security but integration is not easy, says NEC

How to choose the best wireless earbuds for professional calls and deep focus

Sennheiser Momentum 5 Wireless review: Small upgrades to great headphones

Everpure seeks to move data from shelf to centre of usage for businesses

Techgoondu.com is published by Goondu Media Pte Ltd, a company registered and based in Singapore.

Everyday DIY

Leaders Q&A

Advertise with us

Sign up for the TG newsletter

Never miss anything again. Get the latest news and analysis in your inbox.