Hacker News Clone

Weird: Docker Registry returns 500 only from Vultr

by julienmarie on 3/18/2024, 1:05:25 AM with 5 comments

Since last Friday, we've encountered persistent 500 Internal Server Error responses when accessing the Docker registry (https://registry-1.docker.io/v2/) from certain Vultr servers. Interestingly, this issue is not universal across all Vultr instances nor is it present when accessing from local machines, other hosting services, or even some other Vultr servers. The problem seems to be isolated to specific Vultr datacenters, resulting in an inability to login or pull images from the Docker registry.

This issue is significantly impacting the Vultr Kubernetes Engine (VKE), hindering any deployment efforts that rely on Docker registry and preventing the addition of new nodes, as they depend on Calico, which is hosted on the Docker registry.

Vultr's support suggests the problem originates from Docker's side. However, the absence of widespread complaints or discussions on platforms like Twitter about Docker registry outages affecting Kubernetes deployments worldwide makes this unlikely. Notably, discussions on Docker's Discord and Reddit indicate that the issue is confined to Vultr users.

This situation raises questions about the underlying cause and potential solutions. Has anyone else experienced similar issues or found workarounds? Any insight or shared experiences would be greatly appreciated as we navigate this problem.

by robcxyz on 3/19/2024, 3:54:06 AM
This might not be related, but I had an outage of a VKE cluster from Friday to Monday morning and their customer support blamed it on dockerhub. This didn't seem right at all though since the issue only came up when I upgraded a cluster and didn't impact every node. So like their customer support normally does, they figure out some way to deflect the problem (ie point at dockerhub despite the status page only showing some degradation) and ignore it. What didn't inspire confidence though is that their customer support clearly doesn't understand k8s well giving me a response to the effect of "clearly it is dockerhub's fault" when highlighting a pod's status without going into the events or logs of the pod to see the containers were being pulled.
Again, not sure if this is related, but using this as an opportunity to share how bad my experience has been with vultr's customer support over the last couple years. Every time I have interacted with them over an issue it is some diagnosis that makes things not their fault somehow. When people have clusters out because of control plane errors for multiple days, I would think they would be somewhat concerned or give some kind of response to the effect of an apology especially when spending thousands every month. I doubt I'll get any reimbursement.
Worst situation in the past was when I complained about connectivity issues that I was sure related to some firewall on their side that was throwing alarms for my app and kept on trying to get them to look at it. Going absolutely crazy for a month trying to figure out what the hell is going on, finally got my rep to look at it and bam, they see the issues and blamed it on a faulty cable. Faulty cables don't drop packets like what I saw though so now I honestly just don't know what to believe from them.
by LinuxBender on 3/18/2024, 2:06:29 PM
Just anecdotally and perhaps unrelated to your issue, I have a Primary DNS server in Vultr and at times IPv4 times out, then IPv6. It hasn't been persistent enough for me to start troubleshooting it or setting up 3rd party monitoring but I may do that today if others are seeing odd behavior now. Perhaps together we could create a list of service endpoints to monitor each other using curl or dig maybe and find a pattern to it.
Something to play around with
```
    # TCP AXFR.
    kdig @2001:19f0:b001:e83:5400:4ff:fe72:e740 +nocookie +padding=64 +retry=0 +all -t axfr example.net
    dig @216.128.176.142 +nocookie +padding=64 +retry=0 +all -t axfr example.net

    # UDP TXT or whatever
    kdig @2001:19f0:b001:e83:5400:4ff:fe72:e740 +nocookie +padding=64 +retry=0 +all -t txt example.net
    dig @216.128.176.142 +nocookie +padding=64 +retry=0 +all -t txt example.net
```
by inok6743 on 3/18/2024, 2:20:58 AM
I am having the same issue now with the Cloud compute servers at the Tokyo region.
In terms of a workaround, it seems like a server created without an IPv6 address works fine and assigning an IPv6 network to the server causes the issue again for me.
So I guess that something is going wrong with Vultr's network configuration at this point.
by rjst01 on 3/18/2024, 10:56:12 AM
We noticed this yesterday trying to release a minor bug fix. As of this morning it appears to still be broken.
It's hard to see how it could be anything other than an issue on Docker's side - we are seeing a 500 after all. I need to unblock development ASAP so for now the workaround for us has been to migrate our container registry to Azure.