Google Cloud

  Moving fast with happy customers

As industries are being disrupted, organizations are forced to adapt to new operational requirements. This creates a need to move fast while retaining reliable experiences for customers, which can be challenging due to complex application building environments. 

Site reliability engineering (SRE) is a strategy that can help you overcome this challenge. It helps organizations to create reliable systems for customers on one end and development and operations teams on the other. As such, your teams and systems both become more agile and reliable, while freeing up valuable time to spend on innovation without decreasing customer satisfaction. If you are curious, keep reading to find out more about SRE, how it enables Blameless to be more innovative, and how Netflix employs it to keep customers streaming and happy.

What is SRE?

The concept of SRE was first formulated by us in 2003. It is a combination of people, practices, and products that helps to create reliable systems and services that prioritize customer experiences. We at Google do so by treating operations as if it is a software problem. The goal is to protect, provide for, and progress the software systems behind public services by continuously checking on availability, latency, performance, and capacity. That’s why we use the concept of blameless postmortems. It means that we see faults as inevitable, so we stop blaming a system or employee and instead focus on process and technology to see what went wrong and how to fix it. Allow us to explain to you why it can be helpful to work in this manner.

  SRE to the rescue

The most important feature of SRE is reliability. We believe you need to work on reliability all the time, not just when outages occur. By checking and improving the reliability of services together constantly we make sure the services don’t get unstable. This ultimately makes things work better for everyone, because customers will have a better customer experience while your team is no longer worried about being blamed for errors. You are able to adjust a lot faster because big outages are prevented. Preventing outages can not only help you minimize toil or enhance customer experience, but it can also save you lots of money. Gartner estimated that the average costs of downtime for a company go up to $300.000 per hour. SRE is therefore crucial when managing a larger company that needs to be able to move and act quickly upon unforeseen changes while also keeping a focus on customer experience.

Blameless

One of the companies that knows all about SRE is the Google Startup Blameless, an SRE platform that empowers teams to optimize the reliability of their systems without losing time for innovation. According to the CEO, Ashar Rizqi, SRE helped the company during the pandemic to keep up with growing customer demand for online reliability and the need to adjust quickly to changing circumstances. Furthermore, they note that implementing SRE can be especially helpful for quickly growing businesses to make sure that the team has everything it needs to be successful and reliable. Additionally, Blameless uses several Google Cloud tools such as kubernetes to implement and act upon the SRE principles themselves as well. If you are interested to learn more about their specific SRE practices, watch the video below (start button is in the middle of the image).

  Keep on streaming

One of the companies that you surely know and employs SRE effectively is Netflix. It’s no secret we all started to spend a little more time watching our favorite series during lockdown, which meant that keeping Netflix up in the air was important. To manage this Netflix’ SRE team prioritized systemic risk identification, handling the lifecycle of incidents, and reliability consulting.

They also implemented the shared ownership model meaning that the team operated what they built themselves. In this manner, they were able to identify issues before they impacted customers. According to Netflix, focussing on reliability through SRE “empowers us to reveal business-critical socio-technical risks, facilitate effective responses to those risks, and ensure Netflix continues to bring joy to customers”.

If you are motivated to learn more about SRE you can read the books we wrote about it. or watch one of the videos of our SRE series below. 

Sources: