by teraflop on 3/21/2015, 5:07:42 PM
by endymi0n on 3/21/2015, 7:29:23 PM
Wonder if those guys checked out https://github.com/mesos/chronos - it was the best solution I could find when I recently wanted to solve distributed, reliable Cron for us.
by KaiserPro on 3/21/2015, 6:12:26 PM
WE had a similar issue, although a different level of scale.
However Jenkins works as a good cron replacement. Although I'm not sure about the limit to the number of build slaves you attach to jenkins.
An interesting read, but it doesn't look like there's too much exciting or novel here. (In a fundamental sense, that is. I'm sure there's all kinds of interesting nuts-and-bolts engineering that outsiders aren't privy to.) TLDR: use a replicated state machine to make scheduling decisions, and make all operations on the datacenter idempotent.
The hashing trick to mitigate spiky load distributions is cool, but that seems to be more about multi-tenancy than reliability.
I'm disappointed to see this article perpetuating the misconception that Paxos is a leader election algorithm. It tries to elect a leader for its own purposes, but Paxos itself behaves safely even if the election process goes temporarily amok; other systems built on top of it might not. If you want to provide the guarantee that only one scheduler instance is running at a time, you need to add a lease mechanism and make assumptions about clock synchrony. I'm sure the authors know this, but not mentioning it at all seems pretty sloppy.