Hacker News Clone

Scaling our observability platform by embracing wide events and replacing OTel

by valyala on 6/21/2025, 9:23:21 AM with 61 comments

by b0a04gl on 6/21/2025, 2:02:02 PM
tbh that's not the flex. storing 100PB of logs just means we haven't figured out what's actually worth logging. metrics + structured events can usually tell 90% of the story. the rest? trace level chaos no one reads unless prod's on fire. what'd could've done better be: auto pruning logs that no alert ever looked at. or logs that never hit a search query in 3 months. call it attention weighted retention. until then this is just high end digital landfill with compression
by jurgenkesker on 6/21/2025, 11:31:30 AM
So yeah, this is only really relevant for collecting logs from clickhouse. Not for logs from anything else. Good for them, and I really love Clickhouse, but not really relevant.
by mrbluecoat on 6/21/2025, 11:24:28 AM
Noteworthy point:
> If a service is crash-looping or down, SysEx is unable to scrape data because the necessary system tables are unavailable. OpenTelemetry, by contrast, operates in a passive fashion. It captures logs emitted to stdout and stderr, even when the service is in a failed state. This allows us to collect logs during incidents and perform root cause analysis even if the service never became fully healthy.
by jappgar on 6/21/2025, 2:09:54 PM
Observability maximalism is a cult. A very rich one.
by iw7tdb2kqo9 on 6/21/2025, 11:48:28 AM
I haven't worked in ClickHouse level scale.
Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think.
Why would I use ClickHouse instead of storing log data as json file for historical log data?
by the_arun on 6/21/2025, 2:52:28 PM
I didn’t see how long logs are kept - retention time. After x months you may need summary/aggregated data but not sure about raw data.
by Thaxll on 6/21/2025, 1:39:56 PM
I mean if you don´t get the logs when the serivce is down the entire solution is useless.
by atemerev on 6/21/2025, 11:17:14 AM
When I get back from Clickhouse to Postgres, I am always shocked. Like, what it is doing for some minutes importing this 20G dump? Shouldn't it take seconds?
by revskill on 6/21/2025, 12:30:52 PM
THis industry is mostly filled with half-baked or in-progress standards which leads to segmentation of the ecosystems. From graphql, to openapi, to mcp,... to everything, nothing is perfect and it's fine.
The problem is, people who created spec is just following trial and error approach, which is insane.
by the_real_cher on 6/21/2025, 11:02:23 AM
What is the trick that this and dynamo use?
Are they just basically large hash tables?
by ofrzeta on 6/21/2025, 10:47:08 AM
Whenever I read things like this I think: You are doing it wrong. I guess it is an amazing engineering feat for Clickhouse but I think we (as in IT or all people) should really reduce the amount of data we create. It is wasteful.
by tjungblut on 6/21/2025, 10:51:47 AM
tldr, they now do a zero (?) copy of raw bytes instead of marshaling and unmarshaling json.