Librería Portfolio Librería Portfolio

Búsqueda avanzada

TIENE EN SU CESTA DE LA COMPRA

0 productos

en total 0,00 €

SITE RELIABILITY ENGINEERING. HOW GOOGLE RUNS PRODUCTION SYSTEMS
Título:
SITE RELIABILITY ENGINEERING. HOW GOOGLE RUNS PRODUCTION SYSTEMS
Subtítulo:
Autor:
MURPHY, N
Editorial:
O´REILLY
Año de edición:
2016
Materia
PROGRAMACION INTERNET
ISBN:
978-1-4919-2912-4
Páginas:
552
44,50 €

 

Sinopsis

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient-lessons directly applicable to your organization.

This book is divided into four sections:

Introduction-Learn what site reliability engineering is and why it differs from conventional IT industry practices
Principles-Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
Practices-Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systems
Management-Explore Google´s best practices for training, communication, and meetings that your organization can use



Chapter 1Introduction
The Sysadmin Approach to Service Management
Google's Approach to Service Management: Site Reliability Engineering
Tenets of SRE
The End of the Beginning
Chapter 2The Production Environment at Google, from the Viewpoint of an SRE
Hardware
System Software That "Organizesö the Hardware
Other System Software
Our Software Infrastructure
Our Development Environment
Shakespeare: A Sample Service
Principles
Chapter 3Embracing Risk
Managing Risk
Measuring Service Risk
Risk Tolerance of Services
Motivation for Error Budgets
Chapter 4Service Level Objectives
Service Level Terminology
Indicators in Practice
Objectives in Practice
Agreements in Practice
Chapter 5Eliminating Toil
Toil Defined
Why Less Toil Is Better
What Qualifies as Engineering?
Is Toil Always Bad?
Conclusion
Chapter 6Monitoring Distributed Systems
Definitions
Why Monitor?
Setting Reasonable Expectations for Monitoring
Symptoms Versus Causes
Black-Box Versus White-Box
The Four Golden Signals
Worrying About Your Tail (or, Instrumentation and Performance)
Choosing an Appropriate Resolution for Measurements
As Simple as Possible, No Simpler
Tying These Principles Together
Monitoring for the Long Term
Conclusion
Chapter 7The Evolution of Automation at Google
The Value of Automation
The Value for Google SRE
The Use Cases for Automation
Automate Yourself Out of a Job: Automate ALL the Things!
Soothing the Pain: Applying Automation to Cluster Turnups
Borg: Birth of the Warehouse-Scale Computer
Reliability Is the Fundamental Feature
Recommendations
Chapter 8Release Engineering
The Role of a Release Engineer
Philosophy
Continuous Build and Deployment
Configuration Management
Conclusions
Chapter 9Simplicity
System Stability Versus Agility
The Virtue of Boring
I Won't Give Up My Code!
The "Negative Lines of Codeö Metric
Minimal APIs
Modularity
Release Simplicity
A Simple Conclusion
Practices
Chapter 10Practical Alerting from Time-Series Data
The Rise of Borgmon
Instrumentation of Applications
Collection of Exported Data
Storage in the Time-Series Arena
Rule Evaluation
Alerting
Sharding the Monitoring Topology
Black-Box Monitoring
Maintaining the Configuration
Ten Years On.
Chapter 11Being On-Call
Introduction
Life of an On-Call Engineer
Balanced On-Call
Feeling Safe
Avoiding Inappropriate Operational Load
Conclusions
Chapter 12Effective Troubleshooting
Theory
In Practice
Negative Results Are Magic
Case Study
Making Troubleshooting Easier
Conclusion
Chapter 13Emergency Response
What to Do When Systems Break
Test-Induced Emergency
Change-Induced Emergency
Process-Induced Emergency
All Problems Have Solutions
Learn from the Past. Don't Repeat It.
Conclusion
Chapter 14Managing Incidents
Unmanaged Incidents
The Anatomy of an Unmanaged Incident
Elements of Incident Management Process
A Managed Incident
When to Declare an Incident
In Summary
Chapter 15Postmortem Culture: Learning from Failure
Google's Postmortem Philosophy
Collaborate and Share Knowledge
Introducing a Postmortem Culture
Conclusion and Ongoing Improvements
Chapter 16Tracking Outages
Escalator
Outalator
Chapter 17Testing for Reliability
Types of Software Testing
Creating a Test and Build Environment
Testing at Scale
Conclusion
Chapter 18Software Engineering in SRE
Why Is Software Engineering Within SRE Important?
Auxon Case Study: Project Background and Problem Space
Intent-Based Capacity Planning
Fostering Software Engineering in SRE
Conclusions
Chapter 19Load Balancing at the Frontend
Power Isn't the Answer
Load Balancing Using DNS
Load Balancing at the Virtual IP Address
Chapter 20Load Balancing in the Datacenter
The Ideal Case
Identifying Bad Tasks: Flow Control and Lame Ducks
Limiting the Connections Pool with Subsetting
Load Balancing Policies
Chapter 21Handling Overload
The Pitfalls of "Queries per Secondö
Per-Customer Limits
Client-Side Throttling
Criticality
Utilization Signals
Handling Overload Errors
Load from Connections
Conclusions
Chapter 22Addressing Cascading Failures
Causes of Cascading Failures and Designing to Avoid Them
Preventing Server Overload
Slow Startup and Cold Caching
Triggering Conditions for Cascading Failures
Testing for Cascading Failures
Immediate Steps to Address Cascading Failures
Closing Remarks
Chapter 23Managing Critical State: Distributed Consensus for Reliability
Motivating the Use of Consensus: Distributed Systems Coordination Failure
How Distributed Consensus Works
System Architecture Patterns for Distributed Consensus
Distributed Consensus Performance
Deploying Distributed Consensus-Based Systems
Monitoring Distributed Consensus Systems
Conclusion
Chapter 24Distributed Periodic Scheduling with Cron
Cron
Cron Jobs and Idempotency
Cron at Large Scale
Building Cron at Google
Summary
Chapter 25Data Processing Pipelines
Origin of the Pipeline Design Pattern
Initial Effect of Big Data on the Simple Pipeline Pattern
Challenges with the Periodic Pipeline Pattern
Trouble Caused By Uneven Work Distribution
Drawbacks of Periodic Pipelines in Distributed Environments
Introduction to Google Workflow
Stages of Execution in Workflow
Ensuring Business Continuity
Summary and Concluding Remarks
Chapter 26Data Integrity: What You Read Is What You Wrote
Data Integrity's Strict Requirements
Google SRE Objectives in Maintaining Data Integrity and Availability
How Google SRE Faces the Challenges of Data Integrity
Case Studies
General Principles of SRE as Applied to Data Integrity
Conclusion
Chapter 27Reliable Product Launches at Scale
Launch Coordination Engineering
Setting Up a Launch Process
Developing a Launc