Tuesday, January 7, 2025

Olly 2.0

 You blink and suddenly there's Observability 2.0. It's a logical conclusion of where the more interesting things were going, and I'm a little disappointed in myself that I didn't see this coming. One of those "obvious in hindsight" things for me.

Related articles:


I do see the main challenge as adoption. I wonder if this can step out of being a niche pattern. Specifically, I think I disagree with the "critical mass" part of this:

a critical mass of developers have seen what observability 2.0 can do. Once you’ve tried developing with observability 2.0, you can’t go back. That was what drove Christine and me to start Honeycomb, after we experienced this at Facebook. It’s hard to describe the difference in words, but once you’ve built software with fast feedback loops and real-time, interactive visibility into what your code is doing, you simply won’t go back.


The market has been changing rapidly and the capabilities provided by the big providers like Datadog, Dynatrace, Grafana Labs etc. are ever expanding: more integrations to import ever more sources of data, more ways to visualise data (Datadog now even visualises step flow executions). Yet at the same time I feel less excited and more confused then when I first started playing with InfluxDB on a client project in 2014 (real-time claims processing data for an insurance company that business didn't care about because they already had BI systems in place that they were happy with).

It is hard to actually find data that is relevant, nobody builds custom dashboards, instead you go back and forth between a variety of pre-built, generic service pages that show you a lot of little graphs but very little information. To make up for this Datadog, Dynatrace, Wavefront etc. all have some clever processing that tries to do some form of root cause analysis and anomaly detection for you. Often useful, always distracting.

I appreciate that it's no longer the default to have no instrumentation at all and to only hear about outages from customer complaints while you frantically try to deduce what's going on from what little actually useful logs you have. I'm just not sure how to advance beyond that.

On the data analytics side it seems to have now become common to have Data Stewards or some such role to ensure that whatever gets dumped into the data lakes by different teams adheres to some shared understanding of the world. Maybe something like that would be useful to agree upon in the dev world.

I've been excited enough about observability to do a conference talk about it. As mentioned in the introduction of that talk, it's because I want to get more people excited about the topic and because I think it's not easy to get started and see the possibilities if you're starting from scratch.

Honeycomb's sandbox examples seem to be the only thing that comes close to getting a glimpse of that. And even those examples are pretty heavily focused on the operational side.

As an aside - maybe there is a space for "open source" software to showcase some more higher level things to do with observability. IIRC Emelia was working on getting some metrics out of hachyderm and maybe things like that could be expanded on. (Hachyderm also seems large enough to have statistically relevant amounts of data)

It's my impression that not a lot of developers care all that much about observability. And the pretty decent out of the box support of open telemetry type agents covers quite a lot of operational concerns. So I'm wondering what the driving cultural change could be that would get developers interested in putting some time into considering what might be interesting to log about whatever piece of code they're currently working on.

I'm not sure if something equivalent to code coverage checks might be useful, if it were straightforward to build? I've not been a fan of hard coverage targets but as a soft constraint automated coverage and linting checks can be a useful way of automating standards that a team has agreed upon.

These changes might take time, I guess. As an industry we're still not particularly great about code quality. Taking TDD as an example, it's not like proper use of it is at all common. But at the same time doing at least some form of automated testing does seem the default now. I do remember the times when you had to justify the time spent on writing tests. We're clearly in a better place now.

Maybe over time it will become common to expect to be able to interrogate your running software for more interesting data than just how many request per seconds some controller is handling.