Observability is the key to better, faster, more efficient organisations

Image:

Observability is the key to better, faster, more efficient organisations

By creating a big picture view of all operations, APIs, systems and server health, organisations will be better equipped to manage issues as they occur and even predict them in future

Most IT professionals will recognise the scene. Something's broken, but you don't know where or what is the problem. As organisations adopt increasingly distributed systems in wider geographies, these situations become even more painful and can lead to major headaches if left to fester.

Almost half of data-driven companies are operating 50 different data silos, according to 451 Research, each of those with their own accompanying services and systems. It's little wonder, then, that in recent years ‘observability' has gained traction as a must-have, extending visibility into a more all encompassing and unified way to track and trace the health of your IT estate. Where visibility ends, observability goes one step further, flagging potential issues as they could occur and allowing teams to take action based on their insights - a necessity in a world of distributed systems.

But many businesses wait until there's an issue before making observability a priority. This presents challenges for organisations large and small: for large enterprises, systems can become so complex that finding the cause of problems can become a real challenge.

For smaller organisations, it's important they set up observability from the start, because while needing to scale quickly due to growth can be a nice problem to have for smaller businesses, it can also leave them with a much larger, more complex system than when they started, meaning when issues happen they will struggle to identify the root cause.

And without observability in place, Time To Resolution can be unmanageably prolonged, impacting systems and reducing the effectiveness of vital APIs, therefore threatening application effectiveness and ultimately the bottom line. Staff sentiment may suffer, and in an employees' market with skills in short supply, organisations need to do their best to hold onto, rather than frustrate, their talent.

With excellent, active, free and open source tools now widely available, the main roadblock to observability is an organisation's willingness to apply it. A lot has changed over the previous decade in making observability feasible. Once, the only solutions available were paid-for commercial tools; metric gathering systems that were as expensive as they were unwieldy.

When Prometheus exploded onto the scene in 2012 originating at Soundcloud - free, open source, and with a highly active community, and the subject of a fascinating new documentary - it has over time become the de facto standard time series database for storing metrics.

With the shift to more complex, cloud-native systems from then until now, including the adoption of microservices and the ubiquity of container orchestration platforms like Kubernetes, it's become imperative to know exactly what is going on and where. Now, due to tools like Prometheus and others in the open source ecosystem, observability has become not only feasible, but affordable - indeed, it's unaffordable not to work towards observability. We tell customers that they must have observability set up because you can't go into production with confidence, without having the visibility as to what's going on in your system.

What's more, the wider macro shift is towards adopting open standards, as seen in areas like open banking today, and in future with nascent projects like OpenTelemetry and OpenTracing - with heavyweights like Kong already adopting OpenTracing.

This push towards standardisation will assist greatly in setting up observability because they open the way to a certain degree of technology agnosticism - if there's an interface where every system understands open standards, all an organisation will need to do is push metrics from their systems towards those open standards.

Luckily, getting started with observability is relatively simple. Observability is based on three pillars: logs, metrics and tracing. That's because logging can only offer one piece of the puzzle - all three are required in observing distributed systems in order to understand where issues lie, and can make all the difference in getting to the root cause of a problem in minutes or hours compared to weeks.

Logging means getting visibility into servers and platforms within your organisation and creating a centralised location where logs from all systems are stored for better investigation. For example, any given business may be running different servers on different platforms, or in various cloud providers - but by setting up logging aggregation or a centralisation system, issues can easily be viewed from one location.

Next is organising your metrics. This begins with monitoring all your servers, so if an organisation is running bare metal servers, the CPU, RAM, hard drive, data-in-data-out metrics should all be collated and viewable in one place. Prometheus exporters can be used to extract these metrics easily: for example, Prometheus node exporter will display all the statistics of that server; the same is true for Kubernetes clusters using kube state metrics.

Having metrics though is only part of what's needed, setting up alerts with sensible thresholds on these metrics turns them from numbers and dashboards into something more useful. Most alerting services can (and should) be sent to an instant messaging platform like Microsoft Teams or Slack. Or if you want to manage support cases and SLAs then using a support issue tracking systems like ServiceNow or PagerDuty gives you that next layer of support management. This is important because when you're running 100 servers, it becomes unmanageable to manually scroll through many dashboards to keep an eye on individual server health.

Finally, distributed tracing can help organisations debug issues that occur across multiple distributed systems. When a request works its way through systems and an issue happens, locating precisely where the issue has happened can prove difficult, but distributed tracing establishes a correlation ID that follows requests through systems. Then huge latency spikes or other issues can be identified and debugged.

Of course, as with anything, businesses should be mindful of storage or cloud costs, and fostering a culture around observability can be easier said than done. But at a high level, it's about organisations communicating that they are working towards transparency and making life easier for staff. Employees should be introduced to observable methodologies little by little and the value will demonstrate itself. But leaders should be mindful to foster a blame-free culture so issues can be addressed with the purpose of resolving them rather than punishing individual employees; this should be made clear to all reporting lines and senior managers.

Observability will enable far greater visibility and ultimately make life easier for staff and operations leaner for the organisation. What's more, prioritising creating transparent, observable organisations will help future-proof companies for issues they may run into down the line - there's really no downside to getting started today.

Andrew Kew is an API and integration ecosystem specialist at QuadCorps