by RadiozRadioz on 6/21/2025, 3:39:29 PM
by zzyzxd on 6/21/2025, 7:05:30 PM
The article is unnecessarily long only to brag about "a service we didn't use went down so it didn't affect us". If I want to be picky, their architecture is also not perfect:
- Their alerts were not durable. The outage took out the alert system so humans were just eyeballing dashboards during the outage. What if your critical system went down along with that alert system, in the middle of night?
- The cloud marketplace service was affected by cloudflare outage and there's nothiing they could do.
- Tiered stroage was down, disk usage went above normal level. But there's no anomaly detection and no alerts. It survived because t0 storage was massively over provisioned.
- They took pride in using industry well-known designs like cell-based architecture, redundancy, multi-az...ChatGPT would be able to give me a better list
And I don't get whey they had to roast Crowdstrike at the end. I mean, the Crowdstrike incident was really amateur stuff, like, the absolute lowest bar I can think of.
by rybosome on 6/21/2025, 4:05:05 PM
Must be hell inside GCP right now. That was a big outage, and they were tired of big outages years ago. It was already extremely difficult to move quickly and get things done due to the reliability red tape, and I have to imagine this will make it even harder.
by diroussel on 6/21/2025, 8:12:20 PM
> Modern computer systems are complex systems — and complex systems are characterized by their non-linear nature, which means that observed changes in an output are not proportional to the change in the input. This concept is also known in chaos theory as the butterfly effect,
This isn't quite right. Linear systems can also be complex, and linear dynamic systems can also exhibit the butterfly effect.
That is why the butterfly effect is so interesting.
Of course non-linear systems can have a large change in output based on a small input, because they allow step changes, and many other non-linear processes.
by Peterpanzeri on 6/21/2025, 4:46:45 PM
“We got lucky as the way we designed it happened not to use the part of the service that was degraded” this is a stupid statement from them, hope they will be prepared next time
by raverbashing on 6/21/2025, 4:08:22 PM
Lol I love how they call "not spreading your services needlessly across many different servers" as an "Architectural Pattern" (Cell based arch)
They are right, of course, but the way things, the obvious needs to be said sometimes
by bdavbdav on 6/21/2025, 3:50:04 PM
“We got lucky as the way we designed it happened not to use the part of the service that was degraded”
Hmm. Here's what I read from this article: RedPanda didn't happen to use any of the stuff in GCP that went down, so they were unaffected. They use a 3rd party for alerting and dashboarding, and that 3rd party went down, but RedPanda still had their own monitoring.
When I read "major outage for a large part of the internet was just another normal day for Redpanda Cloud customers", I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech. What I got instead was: Google told RedPanda there was an issue, RedPanda had a look and their service was unaffected, nothing needed failing over, then someone at RedPanda wrote an article bragging about their triple-nine uptime & fault tolerance.
I get it, an SRE is doing well if you don't notice them, but the only real preventative measure I saw here that directly helped with this issue, is that they over provision disk space. Which I'd be alarmed if they didn't do.