Borg: google 集群管理操作系统

1. Introduction


Borg provides three main benefits: it

  1. hides the details of resource management and failure handling so its users can focus on application development instead;
  2. operates with very high reliability and availability, and supports applications that do the same; and
  3. lets us run workloads across tens of thousands of machines effectively.





We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important.

When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there.


  1. Keep things simple and robust
  2. Design for failure