How the big boys run their infrastructure

Ciao grande persona,

somewhat troughout the week the name Borg came to my ears for google running their clusters with that. As it has the same name as my favorite backup tool, I was keen to learn more about it. This weeks papers showcases the design and the thoughts that went into building borg. I kinda wondered if it is still in use (as the paper is from 2015, which is stone age in computer times), but regarding an article by the register (Q2, 2020) it is sill in use.

Also this paper showed me I should learn more about cgroups and chroot, some links regarding this two Linux features are enclosed in the links section.

Software exists to create business value

I am Simon Frey, the author of the Weekly CS Paper Newsletter. And I have great news: You can work with me

As CTO as a Service, I will help you choose the right technology for your company, build up your team and be a deeply technical sparring partner for your product development strategy.

Checkout my website simon-frey.com to learn more or directly contact me via the button below.

Let’s work together!

Abstract:

Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. It supports high-availability applications with runtime features that minimize fault-recovery time, and scheduling policies that reduce the probability of correlated failures. Borg simplifies life for its users by offering a declarative job specification language, name service integration, real-time job monitoring, and tools to analyze and simulate system behavior.We present a summary of the Borg system architecture and features, important design decisions, a quantitative analysis of some of its policy decisions, and a qualitative examination of lessons learned from a decade of operational experience with it.

Download Link:

https://storage.googleapis.com/pub-tools-public-publication-data/pdf/43438.pdf

Additional Links:

chroot on Wikipedia
cgroups on Wikipedia
cgroups blog post by grant.pizza (what a great name for a blog :D)