Over half way through the conference now. I've met some really interesting people and attended some great talks on Linux tech and IT operations.
Running Incident Retrospectives
Presenter: Courtney Eckhardt (Heroku)
Posthoc to an incident and what we did (what happened and what we can do to prevent it from happening). Goal is to reduce customer impact and team burnout.
Remediation, an action that can be taken to prevent or reduce the impact of an incident. e.g.
- monitoring & observability
- increased resiliency
- procedural and cultural change
Critical to identify all impact clearly.
Try to find all the remediations possible, even
Consider having someone not involved in the incident, to facilitate. This will depend on their people skills (they need to be quite level headed) - the goal of the facilitator is to probe, ensure lots of questions are asked. A less-familiar facilitator will ensure 'simple' questions are raised, so important details aren't overlooked.
For me - this could be an opportunity to offer my time to another team (and learn about their world), whilst inviting other teams into my ops team - cross-pollination.
Look for conditions the engineer is operating under, which may be improved. e.g. sleep deprivation, distractions, etc.
Contributing Factor Discovery
Avoid "root cause analysis"
- There is always more than one root cause ('root cause' will prevent people from thinking about the other factors)
- Complex systems have complex failures
"Human Error" is not a root cause
- This is where to start investigation, not conclude the investigation.
- Question things like "why did the human take that action", "how long did it take for the human to fix it", "what steps did the human take to remediate", etc
"Try Harder" is not a remediation
- We can't depend on humans to avoid errors. We need to design things around ourselves to help.
Retrospectives must be blame-free
- People shouldn't be in fear for their job
- Retrospectives are a great learning opportunity
- Have empathy for the person who made a mistake (this person is "the second victim" of the incident). Avoid making the guilt worse.
Ask LOTS of questions
Blame-free questioning - look to understand everything
- When (was the incident identified, was support engaged, did escalations occur, was the incident resolved, were communications sent, etc)
- Why (and go many levels deep) - e.g. if someone 'overlooked' a code bug, understand whether they were distracted, were they under pressure, were they not experienced enough, etc
- Who (was involved - whether instigating the change, reviewing/approving the change, initial engineer response, incident manager, stakeholders etc) - this will help ensure you're asking questions to the right people
- Impact (direct technical impact, flow on effects, staff/engineering costs, customer attrition, opportunity cost, cultural costs, relationship/trust costs)
We will never get to 100% reliability (trying is a waste of effort)
Success means avoiding recurrence
You'll see new and more complicated incidents. This is OK (a really good thing) as this means you're making progress.
Long term success measurement is hard. Often unsustainable or ineffective for people to look more holistically. Project-management style reporting on incident success is very costly, but can work for larger teams. Having a management sponsor (someone senior who sees value in effective incident management) also helps. Often, less-tangible measures such as how much pain operational teams are experiencing, can be a good indicator of success.
Consider - how effective are the retrospectives? Worth periodically revising your approach to retrospectives, tune to ensure they remain effective.
Who attends a Retrospective?
- Anyone involved in the incident (e.g. on call engineer, incident manager)
- Stakeholders (especially those impacted) - although external customers might not be engaged
Communicating (post retro) to customers
Try not to make the customer/consumer responsible for preventing the incident from occurring in the future. If you share too much of the detail outside of the internal Ops team, you may inadvertently put pressure on that team to change how they do things to work around your issue.
You need to provide enough info to validate that you've taken it seriously, but don't share your inner workings, you need to give them faith that you are in control of your environment (and they won't dictate how it needs to work)
Talk - Running Retrospectives; Talking for Humans
Presenter: Courtney Eckhardt (Heroku)
Alt topic name: "Words mean things"
Goals of this talk
- Facilitating a retro
- Create a good emotional space for a retrospective.
Three jobs as a facilitator
- Running a productive meeting
- Not screwing things up by making bad jokes
In the retrospective - you may need to encourage people to rephrase what they say to facilitate a less threatening environment.
- Psychological safety
- Blame free
- Keep the meeting moving
It is easy to impart blame in a conversation.
The phrase "you" draws a line between the person speaking and the other person. It is quite aggressive, especially if a sentence starts with "you".
"why" questions can also be aggressive and incorporates blame within the construction of the question.
When speaking of the incident, other strong words to avoid using as they can evoke strong defensive emotions in the other person:
- every time
Use more inclusive works (e.g. 'often' instead of 'always').
Note - some of these are OK for future tense (e.g. "we should always .....")
Example (of not what to say)
"Why didn't you just fix it the last time this happened"?
What to say instead
Ask open ended questions which illicit creativity and complexity in people's responses. Use explorative phrasing. Imagine a better world.
"In order to understand what another person is saying, you must assume it is true and try to imagine what it could be true of"
This allows you to understand their perspective, rather than applying your own judgement
Why do people do things?
Nobody does things they think will blow up the world
People always have reasons for doing things that they do.
The reasons can range from the (uncommon) extreme of being on an altered mental space (psychosis etc) through to a carefully evaluated logical set of reasons. Whilst the reasons might not be just/fair/good from someone else's perspective, the reasons were valid enough for that person to perform the action.
Organisations which design systems, are constrained to produce designs which are copies of the communication structures of these systems.
Be careful that
Useful things for retro meetings
Select a specific note taker (ideally someone who wasn't involved in the meeting)
If you're participating in a conversation and writing it down, you need to pause listening so you can listen and/or talk.
The note-taker is a separate person to the facilitator
Rotate the note-taker role (so a person isn't singled out as someone whose only contribution is taking notes) - everyone participates
Stay on time and on topic
- Set guidelines at the commencement (e.g. if it gets off topic, you may need to steer the conversation in the right direction)
- You won't miss anything
- Let's participants stay with the group
- Ensure people can engage collaboratively
- Digressions can inhibit participating. Write these on a whiteboard and/or in notes for later followup if further discussion is warranted
Also, practice interrupting (even outside of the retro).
- Look at the person speaking, you need to communicate that their contributions are valuable
- Don't interrupt them
- When they are finished speaking, you can summarise the main takeout of what they are saying and what this means to you.
- If the other person is talking for a long time, use posture and voice to indicate you're listening. Then, when there's a gap in the conversation, use this opportunity to summarise.
- Assume positive intent - someone may come across as harsh, but their background (culture, language etc) may prevent them from understanding the harsh nature of the comment. As the facilitator, try "let's work on rephrasing that in a way that is less aggressive and removes blame".
Handling power dynamics
If there's a senior/executive in the retrospective, if you're not already familiar with working with them, invest time before the retrospective meeting to explain the retrospective framework, including the approach/goals of a retro. Some senior people will be more comfortable with more aggressive discussion, so they may not think you're taking it seriously if you're calm.
Watch for people who have not participated. They might be afraid to interrupt. Either ask during the meeting ("Fred, you were involved in X during the incident, what were your observations...") or approach them after the meeting if they seem uncomfortable.
You may need to also calm their nerves (especially if they are perceived as a contributor to the root cause). Assure them that their perspective is valuable.
Anything which might make someone uncomfortable during a retrospective is unhelpful.
Also, anything which might be OK between two people (e.g. John and Steve) might make a 3rd party (listening to the joke) uncomfortable.
Instead, focus on positivity
- Be warm
- Be supportive
- Be inclusive
- Focus on what's important to the business, keep a good perspective
You don't need to be witty
If things go wrong in a retrospective
- If you make a mistake (or offend someone) Apologise, correct yourself and move on.
- Don't hinder on shame or blame
If someone makes a bad joke in a retrospective, you can say "please don't make jokes like that in a retrospective". Also, don't bring it up again, even after the retrospective (although if they ask you later, you can explain why it was inappropriate).
Black Swan Events
Speaker: Laura Nolan
What is a Black Swan Event?
- Hard to predict
- Severe in impact
e.g. in trading, a black swan was the 2008 financial crash
In comparison, white swans are smaller events which are much easier to manage
Black swans can become routine non-incidents.
The class of incidents caused by change can be mostly defeated with canarying (roll out to a small number of small systems first).
Classes of Black Swan
What is it?
Events which are often not predicted. The problems are caused by many different factors, including hitting platform maximums, exceeding tuning maximums, running out of transaction IDs, running out of capacity etc
- Load Testing
- Collaboration between infrastructure team AND Applications team
- Test startup with large data sets
What is it?
This can be malicious, but can also just be an overload in a different part of an ecosystem, which results in a different part of the environment
- Failing fast (this is better than failing slow)
- Deadlines / timeouts / limit retries / exponential backoffs / jitter (randomised retries, ensuring not all clients follow precisely the same pattern)
- Use dashboards & tools to quickly identify the performance bottlenecks
What is it?
Coordinated demand. For instance,
- cron jobs at midnight
- Mobile clients all updating at a specific time
- 300 servers booting simultaneously
- Large pending / queued requests after an outage
- Plan and test - it is likely at some point this will happen, so test for it.
- Selectively drop unimportant workloads
- Asynchronous workloads which can be rescheduled for later
What is it?
External malicious activities designed to impact a system, e.g. Maersk 2017 cyber attack which halted their shipping operations
- Minimise blast radius (isolated production systems from management/desktop systems, separate dev, separate staging)
What is it?
Can you start up your entire service from scratch, with none of your infrastructure running?
- Layer your infrastructure
- Test the process of starting from scratch
- Beware of soft dependencies
Overall strategies to protect against black swan events
- Well defined Incident mgt process, escalations and paging
- Good communication & documentation
Managing OS Release transitions at Netflix Scale
Speaker: Edward Hunter (Netflix)
Consider End Of Life, Upgrades and Painless transitions
- 180,000 virtual machines across data centres all over the world
- Need to upgrade without affecting any customers
- 16 different AWS instance types, across all sorts of hardware configurations. There were around 160 different configurations of AWS instance, storage, CPU, memory etc.
- Running polyglot environment of languages, LOTS of different languages (python, ruby, C++, and LOTS of other environments)
- Over 4,000 deployments per day
- Don't break Netflix - people panic if they can't Netflix
Step back, understand the constraints and plan out how to perform the upgrades. They had three high level steps
Change only the things which need to be changed
Don't change code, drivers etc unless absolutely unaviodable
Get it working internally first
Ensure it works in dev etc first
Have a set of friends to work with
Pre-alpha users, people willing to work with you on upgrades to make sure things work as expected.
- Optional Apache front end
They then apply application code to this and build what they term an "application AMI" which is released into the cloud
Making developer lives easier
IT Ops taking control of EVERYTHING except for the app itself, allowing Developers to focus on the development.
ITOPS built tools such as automated configuration of Apache, Tomcat and Java, so the developers don't need to think to hard about how to config these.
App release workflow seems pretty standard - features merged into unstable branch, then release branch, then promoted into candidate base and release base.
Debian package deployment workflow aligns to the unstable base branch and the candidate base branch and the release branch.
Everything they could think of was automated.
- Image builds
- Image tests
- Automated changelog generation
- Automated performance testing
The image tests are home-grown tests (Java tests, container tests etc)
The performance testing is only infrastructure-level, they don't test the applications. For this infrastructure perf testing, they spin up a very large number of instances and run various home-grown benchmark tests, collect the results, average them etc.
- Quarterly meetings with other managers in NetFlix
- Video training for all the engineers in NetFlix
- Slack Channel during the migrations
- One on one communication with individual teams, offering to help with migrations
Unicorns - things they were surprised by
- Amazon account structure (a tree containing master, streaming accounts, prod / test accounts, billing, etc) - this caught them by surprise. Lots of different permissions etc putting testing integrity at risk (as they might not be able to test everything)
- Organisational structure. When people leave, they take knowledge with them. Also operationally, someone who was responsible for a migration, left before the migration and their teammates didn't know where the migration was up to.
Spent a lot of time with different teams to understand requirements and flesh out the design/approach
It took a year and a half to prepare for these first set of upgrades.
The subsequent upgrades (once the process was bedded in) took a month.
Solving all the systemd problems
Speaker: Alvaro Leiva Geisse - facebook
What is systemd?
A service manager:
This manages your services. It is not nginx, it is not MySQL etc. It starts, stops and manages the lifecycle of your applications, but it is not a service per se.
In System V, you wrote your own service manager (e.g. in /etc/init.d), lots of complex bash scripts.
AppStart vs Systemd - two approaches which were devised to address shortfalls, Systemd endedd up becoming popular and won the race.
Standardised config files. Instead of telling systemd to change dir to a directory, you tell it which dir to run in. Instead of chroot, it can leverage cgroups
Systemd can also provide important metadata (such as service runtime, dependent processes, PIDs etc), instead of needing to write scripts to do this for you.
A new feature which allows you to impose system reosurce limits on services. e.g. CPU restrictions, maximum memory, etc.
systemd keeps track of cgroups.
Even if a new process is spawned (not as a child group) it keeps track of cgroup members and applies the limits accordingly. systemd can also manage all processes which have been spawned, detached from parent etc.
Using the "@" in the service name, e.g. firstname.lastname@example.org
Allows you to spawn lots of instances with the same unit file (but variable instance names). Each instance could have its own config file etc as needed.
Normally, your application will start and bind to a port. If the application fails, the port closes and the request is unsuccessful.
systemd instead listens on the port on your behalf. It then can spawn a service whenever a request is received on the port it is bound to. You only start your application when you require it.
You can start processes easily as a specific user, rather than as root.
A / B Testing
Systemd can facilitate A/B deployments.
Python / systemd library
- You can start/stop services
- you can query status
- you can modify units
Alvaro has written this library
Other resources for learning systemd
Some great talks by Lennart Poettering
Operations Reform - Tom Sawyering your way to Operational Excellence
Speaker: Tom Limoncelli (Stack Overflow)
How to motivate other people to change, without it being "top down".
Tom sawyer said "I bet you can't paint this fence as well as me".
Things people know they should do, but don't necessarily do them
- brushing your teeth
- source control
- Regular service reponse
- emergency response
Google - came up with a spreadsheet, scoring the important areas such as the above. This gave Google a lot of data they could use and assess their operational service health (with some good history) - capturing these stats once per month.
Tinypulse, but with a bit more info.
This isn't about assessing the team, it is measuring how effective the service is.
Why did this work?
- Simple (spreadsheet - didn't require anything complex, no code etc)
- Low barrier to entry
- Leverages pride as a motivator
Creates good culture:
- Blameless - assesses the service, not the people
- Transparency & responsibility - culture of fixing things, not hiding things
Non-monetary recognition of good work
- Encourages copying greatness
- Outstanding work recognised by peers
Helps direct cross-team impact
- Easy to identify larger needs across the organisation
Don't tie bonuses to this
- If bonuses are tied to the scores, people may artificially inflate figures
- Nobody would want to join a team with poor service performance
Instead, you want your best engineers joining the teams who are struggling
Seek perfection, don't require it
- If you require perfection, there's an incentive to lie.
- Perfection is a waste of money (the final 10% costs more than the first 90%)
- Focus your time on the first 90%
Explain it in terms of the cost of perfection if people demand it.
Stack Overflow example
They also put together an assessment.
- very small team (one SRE team)
- More granular definition of "service"
- Scaled the process down - one spreadsheet, scores pass/fail instead of a number
X Axis: backups, upgradability, monitoring, failover/resiliency, build automation, separate dev/testss, documentation, security, capacity planning
Y axis: list of services (IIS, MySQL Server, TLS Certs, Linux Operating Systems, etc)
Why this worked
- Simple! (also used colours with the scale - red is bad, green is good)
- Blameless - Assess the service, not the people
- People motivated to expose their warts and fix them
- Invisible to management no longer (this gives people a voice)
how to fix these issues
This created a new problem - ignoring other projects to fix these problems
to address this:
- set a goal of 20% project hours for tech debt
- "Theme month" (e.g. September is "Fix Backups" month)
- The theory of constraints
Theory of constraints
This says that if you're making improvements upstream of a bottleneck, you're going to increase the strain on the bottleneck; if you're making improvements below the bottleneck you're already starved of resources and are making it worse. You must focus on improvements to the bottleneck itself.
- Nobody likes being told their baby is ugly
- If you give someone an opportunity to fix their own problems, they quite often will