Borg: google 集群管理操作系统

1. Introduction

google服务器集群的管理系统,类似于百度的Matrix,阿里的fuxi,腾讯的台风平台等等,还有开源的mesos

Borg provides three main benefits: it

  1. hides the details of resource management and failure handling so its users can focus on application development instead;
  2. operates with very high reliability and availability, and supports applications that do the same; and
  3. lets us run workloads across tens of thousands of machines effectively.

大规模分布式系统的设计和部署实践

论文 http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf

这篇论文主要从面向运维友好的角度,思考了大规模分布式系统的设计和部署相关的一些原则和最佳实践。

总体设计原则

We have long believed that 80% of operations issues originate in design and development, so this section on overall service design is the largest and most important.

When systems fail, there is a natural tendency to look first to operations since that is where the problem actually took place. Most operations issues, however, either have their genesis in design and development or are best solved there.

对服务整体设计影响最大的一些运维友好的基本原则如下:

  1. Keep things simple and robust
  2. Design for failure