In this blog post we'll explore some of the concepts related to modern observability and some best practices that developers can use in their everyday work. We’ll also share how our team at Infura has implemented some of these practices to make more informed decisions on our products and services and to respond to production incidents more effectively.
What is Observability?
Observability in cloud products and services is not at all a new concept. Since at least the early 2000s, and even more strongly in the last ten years, there have been major efforts to give observability a more prominent role in the minds of software developers at all levels.
The first occurrence of this concept is found in Rudolf Emil Kálmán’s definition of observability from control theory; generally, this relates to the process of inferring the unmeasurable portions of the state of a system, given the portions that are measurable (typically its outputs). That early definition still applies to the context of cloud systems, but the meaning that is usually attached to observability is nowadays broader, and tied more explicitly to the main goal of understanding correlation and patterns, and from those drawing meaningful conclusions on the state of a system. Throughout this post, we'll anchor ourselves to these concepts.
In practical terms, why do we care about this? The first thing that people usually associate observability with is monitoring, as in actively monitoring for some abnormal condition that can be measured, and triggering automated remediation or alerting someone when that condition occurs, so that corrective action can presumably be taken. This is the most familiar aspect, but it is only part of the picture, the part that comes into play when we already know what we want to look for. Developers are acutely aware that, in software, many things happen concurrently. There are often unexpected corner cases, unanticipated interactions between states or code -- and monitoring systems can end up saying that everything is fine when it actually isn't, simply because there's no way to predict all the possible ways that something can go wrong.
This is where the broader sense of observability comes in. It is the ability to look at characteristics of a system that individually may offer only partial information, but when taken together in context may help explain why something is happening. Even when considering only monitoring, ultimately that “why” is still the question we want to answer. It's easy to see how this definition now starts to move away from active monitoring and toward the realm of debugging, of catching failures before they occur by seeing that something is trending in a bad direction, and generally of answering all those questions that we don't know yet we should be asking, but that we will be urgently asking when something goes wrong.
The more meaningful information we have, the easier it will be to form correlations. Conversely, if we don't allow visibility into the affected parts of a system ahead of time, we can be in a lot of trouble.
From here, observability can branch out into other domains, such as cost tracking: we may want to know how much running certain systems costs, how much making a service available costs, and what kind of load a service can withstand and how it should scale in anticipation of increased usage.
Beyond cost tracking, observability comes into play when we start thinking about visibility into processes. For example, when doing a deployment, we may want to know how long the deployment takes. That information becomes even more crucial if the time to deployment starts trending up, i.e., if suddenly we find ourselves with a deployment that should take five minutes, and instead takes twenty minutes. The fundamental insight in that case is not so much that the deployment time has gone over an arbitrary threshold, but that the same deployment is now taking four times longer than it used to, and we'll want to know why.
The same approach can be applied to tracking incident responses, and measuring time to outage detection, time to response, and time to resolution, to see how long issues effectively impact customers. This information can be correlated with information about the systems themselves, to determine where both internal processes and system topologies could be more efficient.
Focusing on Observability
So if these are the goals, where does one focus on observability? The most important area is of course user-facing products. These are by definition what users interact with, so it's essential to be able to answer questions about why things are happening in the systems that power them. This doesn’t mean one can ignore internal components that are not user-facing, because with a few exceptions, they still are what users interact with, just indirectly. With the complex dependencies of microservices in modern architectures, forgetting to focus on the observability of internal components can be a recipe for disaster.
In general, a Production environment is where we'll want to have as much observability as we can manage. Being able to inspect systems during testing is important, but it's in Production that real problems are going to manifest (i.e: under load, under odd traffic patterns that we didn't think possible, when interacting with third-party systems that we have no control over, when new releases get pushed out that operate subtle changes). Thus, Production is where we have to keep a constant eye on performance and also where the real operating costs are.
Observability has to start with data. There are a variety of signals that can be leveraged in current distributed systems, the most common are:
- metrics, which represent lightweight measurements taken and reported by a system. Metrics can relate to performance, health, or can be merely informational. They are typically numeric in nature (e.g., number of processes running on a server) but can have labels associated with them (e.g., server type), and they can be used to compute statistics or visualize trends. For example, the image below shows CPU usage measurements reported over time on an individual server:
- traces can represent the flow of a request, a method call, or any action that can be initiated by a user or a software system, and that can potentially traverse several systems. Traces can tie metrics (e.g., duration) to portions (called spans) of the operations required to complete the overall action, whether those spans represent internal function calls, API calls to other systems, requests to a database, or anything of the sort. For example, the image below shows a portion of the function call flame graph during normal operation of the Geth client, produced with the Go profiling tool pprof:
- logs and events (terms often conflated or used interchangeably) can represent discrete steps or state changes happening during system operation; they are composed of text strings or structured objects with associated metadata, and as such they can be both human-readable and machine-readable. Metrics can also be extracted from logs and events instead of being reported directly. For example, the image below shows a log record from the Teku client:
These concepts are often categorized as whitebox observability (or often interchangeably whitebox monitoring). To make this type of data available, systems must be instrumented. In other words, there has to be some code that publishes metrics, creates logs and events, or enables and supports traces.
To these we can add one more item that is categorized instead as blackbox observability (or monitoring), which is synthetics. Synthetic systems are typically used to send some input to a system purely from the outside, with the goal of observing how the system responds. This could mean sending an HTTP API request on a public endpoint, trying to log in to a website, or simply trying to open a WebSocket connection. Generally, it relates to any interaction with a system by an external user. Synthetic systems on their own don’t require systems to be instrumented and they can provide very valuable insight on the user experience.
Data enables observability, but data alone is not the complete observability picture. What completes the picture is using the data to draw connections and dependencies among resources and components, and to understand how components interact. Metadata structures are essential at this stage to see the common context among all the moving parts. Metadata can come in the form of annotations, tags, or other types of objects. Regardless of what information one chooses to incorporate in the metadata, it should be standardized for consistency and to avoid confusion.
Good observability has to be pervasive and allow us to connect all the dots. It enables us to find correlation among applications and data sets, and trends that can be justified with events that may be external to an individual system. These events can include anything from seemingly unrelated deployments, to underlying network issues, to third party failures.
Presumably, we'll want engineers to look at these correlations, so good dashboards can help achieve an at-a-glance understanding and recognition of patterns to answer all those why questions. For example, the image below shows a dashboard correlating different metrics for a service under test, allowing easy access to other types of related information:
Presumably, we'll also want engineers to not have to look at these correlations. The area of recognizing patterns and trends is one where Machine Learning can shine, because it's where a lot of research time has already been spent: clustering, forecasting, or anomaly detection are all problems that can be relatively complex for humans to solve but are often approached more easily with AI.
And finally, another item to mention is programmability. There isn't an individual, one-size-fits-all product that we can plug in and go on with our day. The details of a system can widely vary, so there must be some way of defining how the data is analyzed. Whether we're building an observability platform, or using a third-party product, the choices we make must have the flexibility of offering at least some degree of custom programmability.
The last few years at Infura
This brings us to what we’ve been doing over the past few years at Infura. We've always kept observability in mind, but over the last couple of years we have taken a more deliberate and focused approach to emphasizing some of the concepts above and formalizing a more solid foundation to expand upon. That, together with using third-party observability platforms, open source tools and other tools and dashboards that we built in house, has let us continue to make good progress in the direction of observability.
Core principles for monitoring and alerting
As a first step, we reviewed our internal processes and came up with revamped principles for monitoring and alerting, along with observability primitives that would help power them. We produced a number of design documents to clearly lay out what we wanted to do moving forward. That conversation is, by its nature, still ongoing, and continues to be revisited constantly.
One of the first actionable steps we took was revamping infrastructure metadata, with a tagging standard for both infrastructure and software components that brought us a lot closer to the consistency level we wanted to achieve. We've been able to leverage a variety of tools to create a context that lets us draw conclusions on our systems. This type of metadata is one of the earliest forms we've used, and our standardization efforts are continuously evolving. Our goal is to enable many of our internal processes from health checking via our SRE tools, to visualizing meaningful information on dashboards, to effective cost tracking. Thanks to standardized tags, we can quickly and easily look up how our storage costs are spread across different services or products; or what resources a specific iteration of one of our microservices is utilizing in terms of server instances, data transfer, metrics storage and so on.
We built a suite of synthetic, user-facing tests (part of that blackbox testing mentioned above), and designed a conceptual framework for metrics and events that these tests would produce so that we could better keep track of them. With this framework in mind, it shouldn't matter how these tests are implemented or where they run: as long as they represent their results consistently, or are connected to a common visualization, we can see how our services are performing from a user's perspective and make necessary changes before our users are impacted.
Not only do we use these user-facing tests for the purposes of alerting, but the tests also provide metrics for added transparency on our status page. This gives our users an idea of near-realtime baseline latency of our services from geographically distributed world locations.
Continuing on the topic of alerting, we've revamped our alerting policies to give more prominence to the user-facing tests, and prioritize symptom-based alerting: that's to say, alerting and explicitly paging an on-call engineer based on a symptom that a user can notice (i.e., high response times, high error rate, lack of service availability) rather than on an internal system condition (i.e., high CPU usage), unless we know that that condition will produce a noticeable symptom soon.
We still maintain the ability to observe internal components, which remains very important, but make it an explicit goal to prevent our on-call engineers from getting overwhelmed with internal alerts that may be harder to interpret, may be side effects of cascading failures, or, even worse, may not be clearly actionable. As part of this effort we have given runbooks more prominence, making sure they include both high-level architectural summaries as well as lists of common incidents with actionable resolution steps.
With incident response, we have taken steps and made plans in both directions of increased automation and improved training, to ensure that our handling of incidents is as effective as it should be. As our incident response processes and our teams have matured over the years, we've been able to switch to follow-the-sun on-call rotations that, in most cases, no longer require us to wake engineers up in the middle of the night when a production incident occurs.
Integrating with third-party observability tools
In the past few years we've used both New Relic and Datadog as third-party observability platforms, and both have helped us improve our observability efforts. These platforms offer solutions for APM, synthetics, logs and metrics and other features that are often integrated with one another with the goal of offering a more complete observability context. We further integrate these platforms with our cloud infrastructure providers so that we're able to correlate infrastructure metrics with any event, and we integrate alerting features with other services like PagerDuty, which we use to manage our follow-the-sun on call rotations.
Our microservices are instrumented so that they can publish traces and other data and we can drill down into the individual spans for every request they serve, see how long each takes to process, any errors returned and other context that we have our microservices provide. Each of these spans may be a call to another microservice or backend system, or an internal function call, and we get to see as precisely as we can where request traffic goes and how it performs. We’ve iterated over the past year to try to find places where we didn't have enough visibility, and bring that level of visibility up.
There are additional sets of metrics we collect from other parts of our architecture as well (for example Ethereum and IPFS clients). We don’t funnel most of these into a third-party platform or other cloud systems, but store them instead on local Prometheus servers. This cuts down the cost of having a cloud service manage metrics for us, while letting us retain the useful data. These metrics are typically plotted in dashboards built with Grafana. We started on this early on, and by now the open source tools Prometheus and Grafana have become de facto standards for this type of observability in many contexts, further validating our initial assumptions. In addition to metrics that are available out of the box with open source clients, we use the same approach for custom health or informational metrics that we publish ourselves via our own components, be they process sidecars, reverse proxies, or other microservices.
Our status page, provided by Statuspage.io, is very important to us for transparency and to provide our users with observability. We have recently revamped it and will continue to improve on it over time. Over the past year we have added improved system metrics and a better component layout that more closely mirrors our user-facing products, and we're in the process of adding more integrations and automation wherever possible; users can now subscribe to receive updates on incidents and scheduled maintenance for all components, or their choice of components, via email, SMS, RSS, Slack and webhooks. Furthermore, Infura users who also use Statuspage for their own products can now add Infura as a third-party component to their public or private status pages, and automatically receive and publish status updates that way.
This is what the component layout looks like this at the time of writing, with the last 90 days of uptime showed for each main Infura product:
User can drill down to show individual components they may be interested in:
The commitment to transparency continues with open source tools we have developed in house and made public. One of these is Versus, which lets users run a stream of requests against multiple Ethereum nodes and providers simultaneously, comparing the output and timing. Another example is Eth2-comply, a compliance testing tool for the Ethereum 2.0 API. We believe that with these tools our users can be empowered with visibility that is often unavailable, ultimately allowed to make better decisions.
Observability is a core principle and not an ephemeral goal. As new products and services come online at Infura, we strive to maintain the same level of awareness for the needs of observability and let our years of experience, our eye for innovation as well as any past missteps guide us toward better ways to maintain the quality of service and reliability that our users expect.
→ Our Eth2 API is live on mainnet! Drop us a message for access to our Beta.
→ Join the waitlist for Infura’s Filecoin API.
→ Submit a request for early access to Infura Transactions.