It's a familiar scenario. As IT infrastructures get increasingly complex, the systems monitoring them need to be ever more precise and more frequent in their reporting, meaning that the volume of the data they generate balloons. In order for this data to be useful in flagging up potential problems, it needs to be processed and analysed in a timely manner, a job that becomes increasingly more challenging as volumes and complexity rise.
This situation to which predictive analytics software vendor Netuitive addresses itself is pulling together the torrents of data emanating from application performance management (APM), diagnostic and network monitoring systems and combining it with that from customer experience monitoring to create an ongoing 360-degree picture of "normal" operations.
If the system departs from this normal picture, the alarm needs to be raised so that action can be taken quickly. The difficulty comes with the sheer number of signals that must be monitored, especially with large organisations such as banks and telcos.
"One customer is managing over a billion metrics a day," said Netuitive's UK technical director, Neil MacGowan.
"Another example is a telco where just one of their applications produces 4,000 performance metrics from a single instance. This explosion in data is being driven by the APM vendors [like CA]. We have to be able to work with all those solutions.
"Out of thousands of metrics, people want a ‘yes' or a ‘no', to understand whether things are normal or abnormal and from that to determine whether they are good or bad, and from that perhaps represent an entire business service in a single green or red icon. But we need do that in an intelligent way rather than the traditional approach of thresholds, rules and scripts, which just won't scale to the size of the problem that people are having to deal with today."
While nominally a big data-type problem, MacGowan contends that the company is not using big data methodoloy – not yet at least. Instead it employs a more traditional incremental sampling technique. Data is sampled and analysed every few minutes and the results compared with those from previous samples in order to check for abnormalities, which might be due to a systems failure, or possibly a DDoS attack.