USENIX LISA 2018 - Day 1

I attende the USENIX LISA (Large Installation Systems Administration) conference in Nashville, Tennessee from 29th to 31st October 2018. Aside from heaps of awesome live country music, some delicious southern food and many hours in at HeadQuarters Beercade for pinball & arcade games, below are some notes from the conference.

Redhat Keynote

https://www.usenix.org/conference/lisa18/presentation/masters

Speaker: Jon Masters (RedHat)

2018 - everyone has been "super excited" about patching their systems every 10 minutes. Meltdown, Spectre, Heartbleed, Bitlocker, etc.

How did we get to this security mess?

Hardware and software devs generally don't talk to each other a great deal.

Trends moving away from this diversification - hardware and software more interlinked.

An architecture is a set of ISA specifications which defines how similar systems can talk to each other. e.g. PA-RISC, x86, SPARC etc. It governs instruction sets and allows code to be compiled against it.

Within architecture, various software layers

  • Apps
  • Operating Systems
  • Hypervisors etc

Applications are targetting the ISA specification. Traditionally, software was written to manage physical memory mappings etc. Nowadays a Memory Manager handles this abstraction layer and the applications no longer care.  Operating Systems are designed to provide the illusion to software that the software has its own dedicated memory space etc, even though it is abstracted.

System On Chip (SoC)

Processor chips containing cache, cores etc and aligned with DDR memory. Caching is all about bringing data you need closer to where you need it. Data coming into the cache all the time, getting replaced all the time when the core is working on something else.

Microarchitecture refers to a specific implementation of an architecture. e.g. within x86 there is Skylake.

In-Order microarchitectures - designed with assumptions about sequential processing order to maximise performance. Generally stalls when waiting for something, resulting in high latency

Out of Order microarchitecrures can determine where there is no dependency between certain instruction sets and separates them. The application cannot tell the difference as it returns the results back to the application in-order. It does this by taking the instructions, load them in a Re-Order buffer and determine the dependencies. It then allows for execution of instructions as soon as dependencies are satisfied, instead of waiting sequentially. This allows for better throughput and lower latency.

Branch Prediction and Speculation.

In order to increase efficiency, some predictions are made as to what might be returned. Instead of sitting around to wait for a value, the program can run ahead. This requires keeping track of all the guesses/predictions made, then allow an 'undo' if the prediction was incorrect. Speculative Execution adds an extra column to the re-order buffer and uses spare processing capacity to guess a result and keep executing with an assumption (but can go back if the assumption is wrong). Minmal loss as the processor would otherwise be idle waiting for the dependency.

If correct - great, a massive performance increase.

If incorrect, there is a small performance hit due to flushing the incorrect resulting calculations.

Branch predictor uses history buffer to get better at guessing which way a branch will go.

Indirect prediction uses less logic to "just do stuff" with spare capacity.

Virtual Memory

Applications have a view that they have memory to themselves.

There is also a kernel memory space which generally sits higher (more important) than app memory space.

Virtual Memory Manager maps memory addresses to physical memory. There's a cache (TLB) in the processor to make this faster.

Virtual memory lookups require even more lookup stages

In the case of a hypervisor, the steps are doubled (as the requests are passed from the guest OS to the hypervisor)

Cache Optimisation

lots of mechanisms for making this more fficient

Side channel Attack

Any way we can monitor somethign and infer what it is doing.

E.g. taking a bank card and measuring the electrical voltages, making assumptions as to what those voltages are doing.

For caches, I can time how long a fetch takes and determine whether it was in the cache or not.

There are ways we can pull things into caches intentionally too.

How this translates to attacks

Vendor responses are on a specific timeline. Limited time to create, test and prepare to deploy mitigations. Have to make a lot of materials fo rthis etc.

Use a combination of interfaces processors have provided. Use a combination of microcode, millicode and software.

Branch Predictors are shared between applications

Meltdown

When an out of order set of instructions are performed. Because the processor is speculating and running ahead. I can briefly access the cache. But doing a cache analysis, I can infer the content of the cache.

Fix meltdown:

  • prevent data from being there
  • Page Table Isolation - prevent userland from accessing the cache. Slowdown but allowing better security

Spectre

What is it?

  • Abuses speculation and allows reading further than I should be able to.
  • Mistrain the branch predictors (resulting in the desired results being made available in the shared speculative cache)

Fix Spectre v1

  • Stop speculation
  • Safely force the speculation to a safe value

Lazy Floating Point

When switching from one app to another - a malicious application could infer the floating point cached value of the previous application.

Other attaks

How to change moving forward

  • Need to change how the hardware is built
  • Need to change how the hardware/software communities engage with each other (better collaboration)

Keynote 2 - Past, Present and Future of Sys Admin

https://www.usenix.org/conference/lisa18/presentation/mon-keynote-3

Speaker: Tameika Reed (Founder of Women In Linux)

Overview

How do you get into Linux and stay in it or pivot & do something else.

Sys Admin skills, even if the landscape has changed, are still valid in the past, present and future:

  • problem solving  analytic
  • Virtualisation
  • Cloud
  • Automaton
  • Performance / Tuning
  • Testing
  • Security (software / network / physical / operational)
  • Scripting
  • Communication
  • Networking

Past - skills.

What skillsets did I start with (on day 1, 1998). What were those skillsets 3 months later, a year later, 3 years, 20 years. Important that these skills evolve. Don't necessarily need to know where to go, but need to be open to learning new things ALWAYS

Plenty of past skillsets which still apply - NFS, SCSI, TFTP, PXE, etc.

side note - selinux and firewalld seem pretty big / common here (lots of ppl using them).

Understanding your customers

A tier 1 / tier 2 admin (e.g. if you're an MSP)

Internal

Business external customers

Present Skills

Are you doing everything (network, hardware, backup, Linux, Windows, DB) or are you specialised?

Infrastructure and Automation Engineer moves away from these other platforms directly, but embraces APIs to instruct the other platforms. Includes

  • Virtualisation
  • Monitoring
  • Backup
  • Docymentation
  • Automation
  • CI/CD
  • Security

Modern Sysadmins need to become familiar with System Architecture design (from a 30,000 foot view).

  • Planning / deployment
  • Agile / ITIL
  • Security, backup, tech roadmap etc
  • Vendor engagement
  • Onsite or offsite or cloud

Future skills

Site Reliability Engineering

  • Strong coding background
  • Understanding SLA/SLO/etc
  • Understand CAP Theorem
  • Incident Management
  • Postmortems
  • Distributed Systems (horizontal / vertical)

Chaos / Intuition Engineer

  • Simulating Workloads
  • Testing hardware conditions
  • Identify performance and availability issues
  • Collaboration with all stakeholders
  • Understanding microservices
  • Analytics, visualisations of data
  • Testing / CICD

Netflix blog has some good blog content on chaos theory.

Blockchain Engineer

Being able to have a digital ID attached to where something has been, who worked on it, the state before & after, and have an immutable record of it. This is logging / auditing.

Example given was car servicing. Someone buys a used car, gets his mate to change the oil but forgot to put a screw on, then trying to blame the car dealer.

Quantum Computing

  • HPC
  • Cryptography (e.g. using a single photon as a private key - detecting eavesdropping if the photon does not reach the destination)
  • Qubit (Quantum Entanglement)

Other thriving areas

  • HPC
  • DevSecOps
  • IOT
  • Gaming
  • Automotive Grade Linux (vehicle automation)

Summary

  • The skillsets from the past are still applicable but have evolved
  • Look at problems from 30,000 ft view
  • Read the documentation
  • Try the opensource version of a product
  • Don't need to work at a big company to get good experience
  • Keep an eye on market trends (webinars, conferences, blogs, magazines, tech news etc)

Talk - SLO Burn

https://www.usenix.org/conference/lisa18/presentation/wilkinson

Speaker: Jamie Wilkinson (Google Australia SRE)

Intro

There's a lot of anxiety around on-call. Lots of the same things repeatedly. Lots of interruptions, pager alerts etc.

We go on call to maintain reliability of services.

The brain should be used for doing things which haven't been done before (i.e. not solved before) rather than repeated simple faults.

Paging should be on what users care about, not on what is broken.

Rapid rate of change on a system means the on-call workload continually grows for a system, often not related to the size of a system.

We are trying to maintain the system being monitored AS WELL AS the monitoring platform.

At google - they are capping on-call work at 50% of time.

Alert Fatigue

Paging on call is prone to generating too much noise. Paging should focus more on pre-defined risks (e.g. :"replication has fallen behind by X amount" or "spare disks in the array has dropped below Y")

What is a symptom?

This is a matter of perspective. A Linux sysadmin symptom will probably look different to a user symptom.

For instance, if a front-end web node drops out, the user may observe higher latency, but sysadmin will not see the latency - instead sysadmin will look at tech symptoms (ping to a node, logs, etc)

Engineering Tolerance - Error Budgets

The acceptable level of errors, availablity loss, performance degradation etc.

Sometimes the budget is used by external factors (natural disasters, bugs, etc). Other times it is used by accidental impact (user/administrator error) or scheduled maintenance.

Measurements

SLI - Service Level Indicator

  • Measurable KPI

SLO - Service Level Objective

  • A goal

SLA - Service Level agreement

  • User expectation/agreement

How do we set an SLO?

  • Negotiate with users
  • Look at historical events
  • Design - look at risks
  • When in doubt - the SLO is the status-quo (i.e. if you don't know it yet, the SLO is your current service level)

A symptom is anything that can be measured by an SLO

A symptom based alert can be programmed against the SLO

SLOs should be defined in terms of requests (i.e. user-based) instead of time. i.e. instead of x hours uptime, should be y% valid requests successfully fulfilled.

Where to measure?

As close to the user as possible. A load balancer is a good place for a web service measurement, that way you're not measing transactions per server etc.

Burn Rate

Mapping out over time, scaling with size, whether an error budget is likely to be exceeded in the longer term. Look at timescale data for long term estimates.

Sometimes errors will ebb and flow, so you shouldn't necessarily alert for temporary spikes. Instead, determine whether the current error rate will significantly exceed the error budget in the long term.

Paging vs SLO

Sometimes an SLO can be breached in short term (e.g. in the last 10 seconds, 10% of queries failed) but long term is OK (in the past hour, 0.0001% of queries failed).

Observability

It is great we now have SLOs defined, but how do we actually know what's going on under the hood. With distributed system, network boundaries, process boundaries etc to digest & understand.

Pillars of observability

  • logs (pre-formatted events)
  • Metrics (prefiltered and preaggregated events)
  • Traces (events in a tree structure)
  • Exceptions / Stack Traces (mass extinction events)

Changes to system design

All new features and changes to a design, should be done with alerting in mind (just the same as unit testing etc is included, monitoring changes also should be considered).

Decommission alerts

As new ways of monitoring are devised, don't be afraid to clean up the less-useful ones.

SLOs of alerts per shift

Consider how long it takes to understand the root cause of an alert. In Google's case, they measured this at 6 hours for a particular team, meaning a 12 hour shift should only result in 2 pages (otherwise breaching this SLO)

Machine Learning?

Non-technical reason ML shouldn't be involved in configuring SLOs - people want to know why they are being disturbed with a page. If this is hidden behind machine learning, they will lose respect for it and stop trusting it.

Pager Impact

If pagers are going to wake us up, needs to either have an immediate impact to operations or present a significant threat to the next scheduled business operations (e.g the following day).

Talk - Netflix Incident Management

https://www.usenix.org/conference/lisa18/presentation/hahn

Speaker: Dave Hahn (Netflix)

Some NetFlix Statistics

hundreds of billions of events per day

tens of billions of requests per day

hundreds of millions of hours of entertainment per day

10s of millions of active devices connected to netflix

millions of containers in Netflix

hundreds of thousands of instances

thousands of production environment changes per day

10s of terrabits of data per day

Goal

When someone has an opportunity to be entertained. The moment of truth is when someone chooses to connect to NetFlix. When someone sees the "cannot connect" error on NetFlix, that moment of truth is lost and they do something else.

Chaos Monkey

When netflix shifted from data centers into the cloud, they decided not to "lift and shift", but they decided to completely re-architect their environment. Firstly, they assume that every instance will disappear. Therefore, the inevitable and unexpected loss of one instance should not be noticeable to a customer. Chaos Monkey validates this.

Designing for failure

designing for 100% success is easy

Designing for 100% failure is easy

Designing for grey areas is difficult (i.e. occasional failures)

Latency Monkey

Introduce X ms artificial latency to y% of requests.

Have tried increasing latency, to 1ms , 50ms, 250ms. It appeared they were resilient to these increases.

At 500ms, customer requests had dropped significantly  huge impact.

Dropping %requests impacted back to 0%, didn't fix it

Dropping latency back to 0ms also didn't fix it.

Turns out - the software had 'learnt' to cater for the increased latency, they were in the middle of changes and they had infected an entire service (not just a small portion)

Learnings:

  • App behaviour
  • Blast Radius
  • Consistency

It took a while to regain customer engagement, much longer to recover than designed.

Failure at Velocity

The increased complexity of these environments has made it difficult to keep up with failure scenarios.

Reasonable Prevention:

Prevention is important, but don't overindex on past failures. Sometimes failures are OK - often there's already something in place (whether retries, etc) at a different layer. A specific failure might require hundreds of things to line up in a particular manner.

Don't overindex on future failures. Sometimes we over-engineer for future failures, but we don't actually understand what the problems will look like and we miss out on opportunities already in front of us.

Invest in resilience.

It needs to be a conscious choice.

New feature development needs to incorporate resilience in line with consumer requirements.

Codify good patterns. Perhaps a shared library etc for something one team finds what works well. The learnings (and pain) one team went through should be usable for other teams.

More chaos!

Invest in further testing to break things intentionally.

Expect Failure

Build any system expecting it to fail. "when" not "if" a failure will occur.

Recovery vs Prevention

Sometimes planning quick recovery is a better use of time than designing a complex set of preventions.

Graceful degradation is also worth considering, is there ways less critical components can be disabled whilst the critical things remain operational.

Incident Management at Netflix Velocity

Goals:

  • Short incidents
  • Small number of consumers impacted
  • Unique failures (don't keep repeating the same ones) - although sometimes recovery is easier. Also ensure you can identify uniqueness quickly
  • Ensure incidents are valuable. There are expensive costs associated with outages, we need to get as much value as possible from the incidents.

Incident management:

  • There are well defined experts in smaller components. For incident manegement, create a team of failure experts (Core SREs who can provide advice on how to respond to incidents). The Core SREs aren't necessarily deep experts, but can engage the right experts as needed.

Before incidents:

Set expectations and provide training

  • right equipment
  • understand metrics logs and dashboard
  • know common things

Education & outreach:

  • reach out to the rest of the organisation, designing the incident management workflows and educating them on how you manage the incidents
  • Understand how different parts of a business is impacted (sales / legal / finance / developers / service desk / etc)

Coordination:

  • Separate engineering teams involved, therefore important to have a central coordinator. The coordinator shouldn't be doing any in-depth engineering.
  • Prepare early - train the coordinator how to be effective during an incident.

Communication:

  • Coordination of communication is important.
  • Get the right and the same message out there
  • Ensure not too many people are involved (mixed messages, noise etc)

Memorialisation:

  • Come back after the incident to understand why the fault occurred, what was effective during the incident, why were you successful in resolving the incident, what could be improved?

Talk - ITOPS - detecting and fixing the "smells"

https://www.usenix.org/conference/lisa18/presentation/mangot

Speaker: Dave Mangot, Lead SRE at SolarWinds

Introduction

This talked looked at burnout in IT, ensure people aren't burnt out with pages/escalations from monitoring system.

"Crawl --> Walk --> Run"

It isn't possible to get to a perfect system immediately. Start with Minimum Viable Product then incrementally fix as you go.

"The developers need to care that operations people are being woken at 3 AM". Problems won't get fixed by someone responsible for the design/architecture of the system unless they are acutely aware of the operational impact. Sometimes this requires a bit of the pain to be pushed back their way to make them care.

Untested Infrastructure

We "should not be" deploying anything to production if it hasn't been tested.

Although I (personally) would argue there's a limit to how much time should be spent on planning & testing.

Ensure there's a production readiness checklist.

Configure with code

Ensure programmatic and repeatable configuration. APIs for ongoing config/management.

Chaos Engineering

Chaos is not introduced to cause problems, it is done to reveal them

Stage is like Prod

Ensure staging servers are identical to production in every way possible.

Talk - Operations-heavy teams

https://www.usenix.org/conference/lisa18/presentation/kehoe

Presenter: Michael Kehoe (SRE at LinkedIn)

Presenter: Todd Palino (SRE at LinkedIn)

"Code Yellow"

ITOPS Problem:

  • Backlog of work
  • Staff shortage and turnover

Took some SREs and took them out of BAU, so they can focus on identifying and fixing these problems. Largely removing complexity, making infrastructure reliable and ensuring it was well documented.

Kafka Problem:

Exponentially growing messages per day.

5 years to get to 1 Trillion messages per day

2 years to increase to 2 Trillion

1 year to 3 Trillion

6 months to 5 trillion

Problem:

  • Multi Tenant
  • no resource controls
  • unclear resource ownership
  • Ad-hoc capacity planning
  • Sudden 100% increase in traffic

Alert fatigue

  • Alerts every 3 minutes
  • No time for proactive work
  • Most alerts non-actionable

Solution:

Used "Code Yellow"

  • Security team helped
  • Dev team fixed some of the problems
  • SREs worked on the non-actionable alerts

Code yellow is when a team has a problem and needs help, aiming for up to 3 months to work through a well-defined problem

  • Problem statement
  • Criteria for getting out of code yellow
  • Resource acquisition
  • Planning
  • Communication

Problem Statement

Admit there is a problem

Measure & Understand the problem

Determine the underlying causes which need to be fixed

Criteria for success

  • SMART Goals
  • Concrete success criteria
  • Keep the Code Yellow open until it is solved

Resource Acquisition

  • Ask other teams for help
  • Use a project manager
  • Set an exit date for resources

Planning

  • Plan short-term work
  • Plan longer-term projects
  • Prioritise anytion which will reduce labour (toil) or will address root cause

Communication

  • Communicate problem statement and exit criteria
  • Send regular project updates (via Project Manager)
  • Ensure stakeholders are aware of delays as early as possible

Summary

  • Measure toil / overhead (costs)
  • Prioritisation (something actually needs to be de-prioritised)
  • Communicate with partners and teams

How do we prevent code yellows in the future?

Build data feed (dashboard / metrics) allowing someone outside of the SRE team (i.e. not in the weeds) to look more holistically at the regular problems and identify issues earlier, before burnout & wasting time on large amounts of time on the problem.

Talk - Perl

https://www.usenix.org/conference/lisa18/presentation/holloway

Speaker: Ruth Holloway (Works at cPanel)

Similarities to Python

  • Use ~= import
  • Function structure in both (syntax is different though(

Tutorial repo here:

https://github.com/GeekRuthie/Perl_sysadmin

This repo should end up being updated periodically - git pull every now and then for updated examples

https://www.cpan.org/scripts/UNIX/System_administration/index.html
Some further examples here.

Perl hints & tips

use strict;

  • Will ensure variables behave correctly

use warnings;

  • (will tell you if you're doing something silly)

Chomp - will strip off any carriage returns

Whitespace doesn't matter. even if you use carriage return, it will treat it as continuation until you use a semicolon

Postfix expressions (i.e. "unless" at the end) are best to only use when using single line statements. Use an "if not" at the beginning of a block of code instead, otherwise it will be difficult to read