Infrastructure and SRE

The infrastructure/SRE team aims to design, implement and made operational infrastructure, as well as providing support for applications in production for mission critical environments.

Roles

  1. Team Leader: is responsible for the customer and for the activities that are carried out.
  2. SRE Engineer: he is responsible for tasks, tickets and events that occur within his area of responsibility.

Circle

The team is organized into circles, which operate independently and are oriented to take care of the activities related to a specific customer/project.

Team Leader

  • Report to CTO and COO
  • High level activities definitions
  • Define and check the progress of activities
  • He takes care of the organization of the team and measures its efficiency
  • Organizes on-call shifts according to services and SLAs

SRE Engineer

  • Communication with the customers about status of the activities
  • It is responsible for carrying out the incidents of customers assigned to the circle of members
  • Writes and improves the documentation related to the circle following the standards
  • Communicates problems and tensions within the circle and by the customers in stand up meetings
  • Develop CI/CD pipeline and migrations
  • Helps developers automate deployment
  • Verifies the status of the pipelines and ensures that they are working properly
  • Keeps infrastructure up to date and safe
  • Keeps services operational
  • Keeps backups operational
  • Interfaces with HW suppliers
  • Develops and maintains tools and software useful for the team following the good development practices suggested by the Software Mentor (external Role)

The philosophy behind this type of organization is to increase the responsibility of the team and to help the people who are part of it to have a higher level of involvement and perception of responsibility. In addition, everyone is responsible for something and this helps to grow every element of the team without limits of seniority. The team has been conceived to follow the basic principles and good practices of Site Reliability Engineering (SRE).

Team feedback loop

The team uses various tools to generate and receive feedback.

The feedback tools are differentiated by purpose and effectiveness and are, as we read in SRE literature, used according to the type of feedback you want to give and the urgency of the event you have to manage.

An example of tools used can be mail, GitLab/Jira (e.g. Issue and task tracking), Teams and a ticket system (e.g. Jira Service Desk).

Engagement

External

The times and methods of engagement are formalized by Agile Lab towards customers in terms of SLAs/SLOs/SLIs and mode of engagement (eg. ticket, email, call). The team adapts to the needs of customers and operates in accordance with the agreements managing the interventions and the related communication in an effective and efficient manner.

In case of strong criticality, the engagement can / must be made by mobile phone to a member of the team and it can not refuse to manage the engagement even if busy with other activities.

Documentation

For each project/infrastructure a technical documentation is produced that is written in markdown format and versioned through a Git repository ({customer}.{project}.DevOps); this documentation contains:

  • an exhaustive infrastructural map of the solution (e.g. made by draw.io)
  • hosts list, connection matrix, network configuration
  • a detailed set of commands (command misc) to act on the various parts of the infrastructure
  • backup details: storage and procedures for recovering and restoring the backup during a critical event
  • HA test reports
  • Postmortems

A customer agnostic project is used to contain common knowledge about services and tips&tricks. Everybody in the team is asked to contribute and keep it updated.

Projects

Projects are defined as a sequence of activities aimed to reach a particular goal (e.g environment creation, tool development, performance optimization etc.)
As for a software project, we adopt company workflows and guidelines.

Activities vs. Incidents

Activities

Activities can be independent tasks, or can be part of a project.

  • Activities are defined with the team leader
  • You can work on an activity if the related Gitlab issue has been defined
  • You should only work on one activity at a time
  • Each activity is defined as follows (use related template):
    • Title, Description
    • Tasks
    • Comment feedback daily (on wip activity)
    • Deadline
    • Definition of done
  • All activities are defined day by day using Agile methodology, which is repeated every day
  • All activities/MR should be reported in the customer's repository, while generic issues/MR should be reported in the team's internal repository
  • All activities/MR need to be labelled with standard tags/labels
  • When activities/MR are ready, they are assigned for a review to the Team Leader

Incidents

  • Incidents are opened and managed using a ticket system, manually or automatically generated by proactive monitoring.
  • Incidents are assigned to the on-call team member or properly routed to the correct person by means of the defined escalation path.
  • [Definition of done] At the end of the incident, when deemed appropriate, the post-mortem and resolution documentation can be produced (see postmortem section).
  • [Definition of done] If there is a business impact, during each phase of the incident the customers and stakeholders must be kept up to date and when it ends a report must be produced explaining what happened, actions taken and root causes

Postmortem

A postmortem goes beyond the documentation of an incident. It represents a detailed and blameless analysis of symptoms, root cause and timeline of the event, but it is mostly about writing down the corrective actions that should prevent the same problem twice.

Postmortems are expected after any significant undesirable event. Writing them is a process that requires a considerable amount of time and effort, for this reason we have to choose when to write one, and there are specific triggers that help this decision:

  • data loss
  • on call team member intervention
  • impact on the business
  • monitoring failure

While the analysis and technical details of the incident are produced by the on-call team member, who is the only one that has a clear timeline of the events, this is just the 50% of the work.

For a postmortem to have a value, it needs to be reviewed by the team, which has to define and take charge of corrective actions.

@AgileLab https://www.agilelab.it/

Found an error in the handbook? The source code can be found here. Please feel free to edit and contribute a merge request, powered by Gitbook
Modified at: 2021-04-08 10:00:09

results matching ""

    No results matching ""