Perhaps you’ve now implemented some of the DevOps principles and processes. The next step is to make it work at scale by implementing SRE. Here are some methods that have proven effective in getting teams off the ground with SRE principles:
Introduce Service-Level Objectives (SLOs) and Error Budget
SRE teams negotiate service-level objectives (SLOs) with stakeholders. SLOs define the level of service customers can expect. Service levels are measured by service-level indicators (SLIs). SLIs evaluate the critical aspects of the service.
SLOs also define the amount of time that an application can be unavailable, which can then be utilised for driving innovation and improving reliability. This allowable downtime is defined as the “Error Budget,” the allowable threshold for defects and outages, which the development team can “spend” to implement or repair features until the budget is depleted. ZEN Software’s Agile Analytics implements an amazing suite to manage and track Error Budgets for your organisation.
This both ensures availability and flexibility to improve or update the existing system. Traditional uptime measures are not a sufficient metric for complex systems, as they assume that all services are created equally. Rather, SRE measures availability through a formula evaluating how many customer interactions are successful:
number of successful requests Availability = ------------------------------------- total requests
SLOs measure the performance of the service using the SLI. This Service Level Indicator (SLI) is a numerical representation (e.g., 99.981%). Subtracting the unreliability from the previous period from the SLO results in the error budget, a figure that can be converted into the number of minutes of allowable downtime per month, they can be used for innovation and experimentation. (Figure 2) in this figure, you see The availability Error Budget gradually decreasing because of small but constant problems.
ZEN Software’s Agile Analytics allows you to track and observe the performance (in real time) of your services and take corrective action when Error Budgets run out.
Figure 2: the latency is a problem here (yellow)
Understanding improving reliability and performance is an organisational problem
Understanding how to improve availability requires putting developers and operations teams on the same stage. SRE requires a fuller understanding of the interactions between the two, usually by assigning a site reliability engineer to development team activities as an orchestrator, and assessing which tools are appropriate for both teams. SRE teams need to negotiate with the business customer to determine proper needs and capacity.
Ensuring teams have the organisational capacity to deliver SRE
SRE is a significant investment, and organisations need to organise implementation based on the mission and on an adequate budget. Each sprint should have the proper organisational requirements to ensure that customers can reliably use the product. This requires both the availability of proper tools, such as automated scripts, as well as organisational components like having development teams on call in case products are incomplete in preproduction. To achieve this, SRE tasks focused on managing toil and technical debt are capped at half of all activities to ensure they have the resources to manage it.
Focusing on the problem, not the cause
SRE calls for a blameless analysis of understanding how teams can learn from a problem, rather than assigning blame. This means understanding what made the problem happen, not what could have stopped the problem from happening, and understanding every level of causation rather than one singular root cause if need be. Blameless Postmortems should be carried out after every major incident, with an emphasis on leveraging all available data sources.