Hacker News Clone

It's the end of observability as we know it (and I feel fine)

by gpi on 6/11/2025, 12:34:55 AM with 208 comments

by RainyDayTmrw on 6/11/2025, 4:39:43 AM
I think we are, collectively, greatly underestimating the value of determinism and, conversely, the cost of nondeterminism.
I've been trialing a different product with the same sales pitch. It tries to RCE my incidents by correlating graphs. It ends up looking like this page[1], which is a bit hard to explain in words, but both obvious and hilarious when you see it for yourself.
[1]: https://tylervigen.com/spurious-correlations
by zug_zug on 6/11/2025, 2:37:29 AM
As somebody who's good at RCA, I'm worried all my embarrassed coworkers are going to take at face value a tool that's confidently incorrect 10% of the time and screw stuff up more instead of having to admit they don't know something publicly.
It'd be less bad if the tool came to a conclusion, then looked for data to disprove that interpretation, and then made a more reliably argument or admitted its uncertainty.
by heinrichhartman on 6/11/2025, 7:44:40 AM
> New Relic did this for the Rails revolution, Datadog did it for the rise of AWS, and Honeycomb led the way for OpenTelemetry.
I find this reading of history of OTel highly biased. OpenTelemetry was born as the Merge of OpenCensus (initiated by Google) and OpenTracing (initiated by LightStep):
https://opensource.googleblog.com/2019/05/opentelemetry-merg...
> The seed governance committee is composed of representatives from Google, Lightstep, Microsoft, and Uber, and more organizations are getting involved every day.
Honeycomb has for sure had valuable code & community contributions and championed the technology adoption, but they are very far from "leading the way".
by stego-tech on 6/11/2025, 3:06:43 AM
Again, sales pitch aside, this is one of the handful of valuable LLM applications out there. Monitoring and observability have long been the exclusive domains of SRE teams in large orgs while simultaneously out of reach to smaller orgs (speaking strictly from an IT perspective, NOT dev), because identifying valuable metrics and carving up heartbeats and baselines for them is something that takes a lot of time, specialized tooling, extensive dev environments to validate changes, and change controls to ensure you don’t torch production.
With LLMs trained on the most popular tools out there, this gives IT teams short on funds or expertise the ability to finally implement “big boy” observability and monitoring deployments built on more open frameworks or tools, rather than yet-another-expensive-subscription.
For usable dashboards and straightforward observability setups, LLMs are a kind of god-send for IT folks who can troubleshoot and read documentation, but lack the time for a “deep dive” on every product suite the CIO wants to shove down our throats. Add in an ability to at least give a suggested cause when sending a PagerDuty alert, and you’ve got a revolution in observability for SMBs and SMEs.
by techpineapple on 6/11/2025, 1:10:43 AM
I feel like the alternate title of this could be “how to 10x your observability costs with this one easy trick”. It didn’t really show a way to get rid of all the graphs, the prompt was “show me why my latency spikes every four hours”. That’s really cool, but in order to generate that prompt you need alerts and graphs. How do you know you’re latency is spiking to generate the prompt?
The devil seems to be in the details, but you’re running a whole bunch more compute for anomaly detection and “ Sub-second query performance, unified data storage”, which again sounds like throwing enormous amounts of more money at the problem. I can totally see why this is great for honeycomb though, they’re going to make bank.
by resonious on 6/11/2025, 3:03:01 AM
The title is a bit overly dramatic. You still need all of your existing observability tools, so nothing is ending. You just might not need to spend quite as much time building and staring at graphs.
It's the same effect LLMs are having on everything, it seems. They can help you get faster at something you already know how to do (and help you learn how to do something!), but they don't seem to outright replace any particular skill.
by geraneum on 6/11/2025, 5:05:36 AM
> This isn’t a contrived example. I basically asked the agent the same question we’d ask you in a demo, and the agent figured it out with no additional prompts, training, or guidance. It effectively zero-shot a real-world scenario.
As I understand, this is a demo they already use and the solution is available. Maybe it should’ve been a contrived example so that we can tell if the solution was not in training data verbatim. Not that it’s not useful what the LLM did but if you announce the death of observability as we know it, you need to show that the tool can generalize.
by nilkn on 6/11/2025, 2:06:28 PM
It's not the end of observability as we know it. However, the article also isn't totally off-base.
We're almost certain to see a new agentic layer emerge and become increasingly capable for various aspects of SRE, including observability tasks like RCA. However, for this to function, most or even all of the existing observability stack will still be needed. And as long as the hallucination / reliability / trust issues with LLMs remain, human deep dives will remain part of the overall SRE work structure.
by yellow_lead on 6/11/2025, 6:19:39 AM
Did AI write this entire article?
> In AI, I see the death of this paradigm. It’s already real, it’s already here, and it’s going to fundamentally change the way we approach systems design and operation in the future.
How is AI analyzing some data the "end of observability as we know it"?
by ok_dad on 6/11/2025, 2:08:10 AM
"Get AI to do stuff you can already do with a little work and some experts in the field."
What a good business strategy!
I could post this comment on 80% of the AI application companies today, sadly.
by gilbetron on 6/11/2025, 12:44:25 PM
There's a bit of a flaw in the "don't need graphs and UIs to look at your data" premise behind this article: sure, LLMs will be great ... when the work great. When they fail, you need a human there to figure it out and they will still need the graphs.
Furthermore, while graphing and visualization are definitely tough, complex parts about observability, gathering the data and storing it in forms to meet the complex query demands are really difficult as well.
Observability will "go away" once AI is capable of nearly flawlessly determining everything out itself, and then AI will be capable of nearly anything, so the "end of observability" is the end of our culture as we know it (probably not extinction, but more like culture will shift profoundly, and probably painfully).
AI will definitely change observability, and that's cool. It already is, but has a long way to go.
by kacesensitive on 6/11/2025, 4:23:11 AM
LLMs won't replace observability, but they absolutely change the game. Asking "why is latency spiking" and getting a coherent root cause in seconds is powerful. You still need good telemetry, but this shifts the value from visualizing data to explaining it.
by Kiyo-Lynn on 6/11/2025, 7:46:57 AM
I used to think that monitoring and alerting systems were just there to help you quickly and directly see the problems.But as the systems grew more complex, I found that the dashboards and alerts became overwhelming, and I often couldn’t figure out the root cause of the issue. Recently, I started using AI to help with analysis, and I found that it can give me clues in a few seconds that I might have spent half a day searching for.
While it's much more efficient, sometimes I worry that, even though AI makes problem-solving easier, we might be relying too much on these tools and losing our own ability to judge and analyze.
by satisfice on 6/11/2025, 5:04:31 AM
So many engineers feel fine about a tool that they cannot rely upon.
Without reliability, nothing else matters, and this AI that can try hypotheses so much faster than me is not reliable. The point is moot.
by schwede on 6/11/2025, 6:26:37 AM
Maybe I’m just a skeptic, but it seems like a software engineer or SRE familiar with the application should be able to come to the conclusion of load testing fairly easily. For sure not as fast like 80 seconds though which is impressive. As noted you still need an engineer to review the data and complete those proposed action items.
by physix on 6/11/2025, 2:04:55 AM
I'd like to see the long list of companies that are in the process of being le cooked.
by vanschelven on 6/11/2025, 6:37:54 AM
> New abstractions and techniques... hide complexity, and that complexity requires new ways to monitor and measure.
If the abstractions hide complexity so well you need an LLM to untangle them later, maybe you were already on the wrong track.
Hiding isn't abstracting, and if your system becomes observable only with AI help, maybe it's not well designed, just well obfuscated. I've written about this before here: https://www.bugsink.com/blog/you-dont-need-application-perfo...
by dgellow on 6/11/2025, 10:49:02 AM
Sort of related: using Claude code with the gcloud CLI, only allowing read only commands (and of course no ssh), and with supervision, is such a superpower. I don’t think I can go back to debugging my infra manually. It’s like all use of Claude code, not a fire and forget, you have to guide and correct it, but that’s so much faster and easier than dealing directly with the mess GCP APIs is
by mediumsmart on 6/11/2025, 3:52:33 AM
I thought the article was about the end of observability of the real world as we knew it and was puzzled why they felt fine.
by devmor on 6/11/2025, 6:27:00 AM
As the AI growth cycle stagnates while valuations continue to fly wildly out of control and more and more of the industry switches from hopeful to a bearish sentiment, I’ve started to find this genre of article extremely funny, if not pitiable.
Who are you trying to convince with this? It’s not going to work on investors much longer, it’s mostly stopped working on the generically tech-inclined, and it’s never really worked on anyone who understands AI. So who’s left to be suckered by this flowery, desperate prose? Are you just trying to convince yourselves?
by stlava on 6/11/2025, 2:00:42 AM
I feel that if you need an LLM to help pivot between existing data it just means the operability tool has gaps in user functionality. This is by far my biggest gripe with DataDog today. All the data is there but going from database query to front end traces should be easy but is not.
Sure we can use an LLM but I can for now click around faster (if those breadcrumbs exist) than it can reason.
Also the LLM would only point to a direction and I’m still going to have to use the UI to confirm.
by pmbauer on 6/11/2025, 3:41:53 PM
It must burn a little blogging about an LLM-driven latency analysis _internal demo_ only to have Datadog launch a product in the same space a day later. https://www.datadoghq.com/blog/bits-ai-sre/
by neuroelectron on 6/11/2025, 6:50:28 AM
This would have been really nice to have when I was in Ops. Running MapReduce on logs and looking at dozens of graphs made up most of my working hours. We did eventually get the infrastructure for live filtering but that was just before the entire sector was outsourced.
by akrauss on 6/11/2025, 6:19:13 AM
I would be interested in reading what tools are made available to the LLM, and how everything is wired together to form an effective analysis loop. It seems like this is a key ingredient here.
by AdieuToLogic on 6/11/2025, 2:10:38 AM
This post is a thinly veiled marketing promo. Here's why.
Skip to the summary section titled "Fast feedback is the only feedback" and its first assertion:
```
  ... the only thing that really matters is fast, tight
  feedback loops at every stage of development and operations.
```
This is industry dogma generally considered "best practice" and sets up the subsequent straw man:
```
  AI thrives on speed—it'll outrun you every time.
```
False.
"AI thrives" on many things, but "speed" is not one of them. Note the false consequence ("it'll outrun you every time") used to set up the the epitome of vacuous sales pitch drivel:
```
  To succeed, you need tools that move at the speed of AI as well.
```
I hope there's a way I can possibly "move at the speed of AI"...
```
  Honeycomb's entire modus operandi is predicated on fast
  feedback loops, collaborative knowledge sharing, and
  treating everything as an experiment. We’re built for the
  future that’s here today, on a platform that allows us to
  be the best tool for tomorrow.
```
This is as subtle as a sledgehammer to the forehead.
What's even funnier is the lame attempt to appear objective after all of this:
```
  I’m also not really in the business of making predictions.
```
Really? Did the author read anything they wrote before this point?
by captainbland on 6/11/2025, 11:55:12 AM
Gripe: enshittification is tangential to performance concerns. It doesn't just mean software getting bad, it means software which may be technically accomplished doing things which are bad for users in service of creating ROI for investors after a period of market share building.
by catlifeonmars on 6/11/2025, 7:37:24 AM
Was anyone else just curious about those odd spikes and was disappointed the article didn’t do a deeper dive to explain that unusual shape?
by favflam on 6/11/2025, 4:58:44 AM
I find people relying way too much on AI tools. If I pay someone a salary, they need to understand the actually answer the give me. And their butt needs to be on the line if the answer is wrong. That is the purpose of them getting a salary. It is not just to do the work, but it is to be responsible for the results. AI breaks this in a lot of the use cases I see crop up on ycombinator.
If some AI tools outstrips the ability for a human to be in the decision loop, then that AI tool's usefulness is not so great.
by nektro on 6/11/2025, 9:23:14 PM
lol