I attende the USENIX LISA (Large Installation Systems Administration) conference in Nashville, Tennessee from 29th to 31st October 2018. Aside from heaps of awesome live country music, some delicious southern food and many hours in at HeadQuarters Beercade for pinball & arcade games, below are some notes from the conference.
Speaker: Jon Masters (RedHat)
2018 - everyone has been "super excited" about patching their systems every 10 minutes. Meltdown, Spectre, Heartbleed, Bitlocker, etc.
How did we get to this security mess?
Hardware and software devs generally don't talk to each other a great deal.
Trends moving away from this diversification - hardware and software more interlinked.
An architecture is a set of ISA specifications which defines how similar systems can talk to each other. e.g. PA-RISC, x86, SPARC etc. It governs instruction sets and allows code to be compiled against it.
Within architecture, various software layers
- Operating Systems
- Hypervisors etc
Applications are targetting the ISA specification. Traditionally, software was written to manage physical memory mappings etc. Nowadays a Memory Manager handles this abstraction layer and the applications no longer care. Operating Systems are designed to provide the illusion to software that the software has its own dedicated memory space etc, even though it is abstracted.
System On Chip (SoC)
Processor chips containing cache, cores etc and aligned with DDR memory. Caching is all about bringing data you need closer to where you need it. Data coming into the cache all the time, getting replaced all the time when the core is working on something else.
Microarchitecture refers to a specific implementation of an architecture. e.g. within x86 there is Skylake.
In-Order microarchitectures - designed with assumptions about sequential processing order to maximise performance. Generally stalls when waiting for something, resulting in high latency
Out of Order microarchitecrures can determine where there is no dependency between certain instruction sets and separates them. The application cannot tell the difference as it returns the results back to the application in-order. It does this by taking the instructions, load them in a Re-Order buffer and determine the dependencies. It then allows for execution of instructions as soon as dependencies are satisfied, instead of waiting sequentially. This allows for better throughput and lower latency.
Branch Prediction and Speculation.
In order to increase efficiency, some predictions are made as to what might be returned. Instead of sitting around to wait for a value, the program can run ahead. This requires keeping track of all the guesses/predictions made, then allow an 'undo' if the prediction was incorrect. Speculative Execution adds an extra column to the re-order buffer and uses spare processing capacity to guess a result and keep executing with an assumption (but can go back if the assumption is wrong). Minmal loss as the processor would otherwise be idle waiting for the dependency.
If correct - great, a massive performance increase.
If incorrect, there is a small performance hit due to flushing the incorrect resulting calculations.
Branch predictor uses history buffer to get better at guessing which way a branch will go.
Indirect prediction uses less logic to "just do stuff" with spare capacity.
Applications have a view that they have memory to themselves.
There is also a kernel memory space which generally sits higher (more important) than app memory space.
Virtual Memory Manager maps memory addresses to physical memory. There's a cache (TLB) in the processor to make this faster.
Virtual memory lookups require even more lookup stages
In the case of a hypervisor, the steps are doubled (as the requests are passed from the guest OS to the hypervisor)
lots of mechanisms for making this more fficient
Side channel Attack
Any way we can monitor somethign and infer what it is doing.
E.g. taking a bank card and measuring the electrical voltages, making assumptions as to what those voltages are doing.
For caches, I can time how long a fetch takes and determine whether it was in the cache or not.
There are ways we can pull things into caches intentionally too.
How this translates to attacks
Vendor responses are on a specific timeline. Limited time to create, test and prepare to deploy mitigations. Have to make a lot of materials fo rthis etc.
Use a combination of interfaces processors have provided. Use a combination of microcode, millicode and software.
Branch Predictors are shared between applications
When an out of order set of instructions are performed. Because the processor is speculating and running ahead. I can briefly access the cache. But doing a cache analysis, I can infer the content of the cache.
- prevent data from being there
- Page Table Isolation - prevent userland from accessing the cache. Slowdown but allowing better security
What is it?
- Abuses speculation and allows reading further than I should be able to.
- Mistrain the branch predictors (resulting in the desired results being made available in the shared speculative cache)
Fix Spectre v1
- Stop speculation
- Safely force the speculation to a safe value
Lazy Floating Point
When switching from one app to another - a malicious application could infer the floating point cached value of the previous application.
How to change moving forward
- Need to change how the hardware is built
- Need to change how the hardware/software communities engage with each other (better collaboration)
Keynote 2 - Past, Present and Future of Sys Admin
Speaker: Tameika Reed (Founder of Women In Linux)
How do you get into Linux and stay in it or pivot & do something else.
Sys Admin skills, even if the landscape has changed, are still valid in the past, present and future:
- problem solving analytic
- Performance / Tuning
- Security (software / network / physical / operational)
Past - skills.
What skillsets did I start with (on day 1, 1998). What were those skillsets 3 months later, a year later, 3 years, 20 years. Important that these skills evolve. Don't necessarily need to know where to go, but need to be open to learning new things ALWAYS
Plenty of past skillsets which still apply - NFS, SCSI, TFTP, PXE, etc.
side note - selinux and firewalld seem pretty big / common here (lots of ppl using them).
Understanding your customers
A tier 1 / tier 2 admin (e.g. if you're an MSP)
Business external customers
Are you doing everything (network, hardware, backup, Linux, Windows, DB) or are you specialised?
Infrastructure and Automation Engineer moves away from these other platforms directly, but embraces APIs to instruct the other platforms. Includes
Modern Sysadmins need to become familiar with System Architecture design (from a 30,000 foot view).
- Planning / deployment
- Agile / ITIL
- Security, backup, tech roadmap etc
- Vendor engagement
- Onsite or offsite or cloud
Site Reliability Engineering
- Strong coding background
- Understanding SLA/SLO/etc
- Understand CAP Theorem
- Incident Management
- Distributed Systems (horizontal / vertical)
Chaos / Intuition Engineer
- Simulating Workloads
- Testing hardware conditions
- Identify performance and availability issues
- Collaboration with all stakeholders
- Understanding microservices
- Analytics, visualisations of data
- Testing / CICD
Netflix blog has some good blog content on chaos theory.
Being able to have a digital ID attached to where something has been, who worked on it, the state before & after, and have an immutable record of it. This is logging / auditing.
Example given was car servicing. Someone buys a used car, gets his mate to change the oil but forgot to put a screw on, then trying to blame the car dealer.
- Cryptography (e.g. using a single photon as a private key - detecting eavesdropping if the photon does not reach the destination)
- Qubit (Quantum Entanglement)
Other thriving areas
- Automotive Grade Linux (vehicle automation)
- The skillsets from the past are still applicable but have evolved
- Look at problems from 30,000 ft view
- Read the documentation
- Try the opensource version of a product
- Don't need to work at a big company to get good experience
- Keep an eye on market trends (webinars, conferences, blogs, magazines, tech news etc)
Talk - SLO Burn
Speaker: Jamie Wilkinson (Google Australia SRE)
There's a lot of anxiety around on-call. Lots of the same things repeatedly. Lots of interruptions, pager alerts etc.
We go on call to maintain reliability of services.
The brain should be used for doing things which haven't been done before (i.e. not solved before) rather than repeated simple faults.
Paging should be on what users care about, not on what is broken.
Rapid rate of change on a system means the on-call workload continually grows for a system, often not related to the size of a system.
We are trying to maintain the system being monitored AS WELL AS the monitoring platform.
At google - they are capping on-call work at 50% of time.
Paging on call is prone to generating too much noise. Paging should focus more on pre-defined risks (e.g. :"replication has fallen behind by X amount" or "spare disks in the array has dropped below Y")
What is a symptom?
This is a matter of perspective. A Linux sysadmin symptom will probably look different to a user symptom.
For instance, if a front-end web node drops out, the user may observe higher latency, but sysadmin will not see the latency - instead sysadmin will look at tech symptoms (ping to a node, logs, etc)
Engineering Tolerance - Error Budgets
The acceptable level of errors, availablity loss, performance degradation etc.
Sometimes the budget is used by external factors (natural disasters, bugs, etc). Other times it is used by accidental impact (user/administrator error) or scheduled maintenance.
SLI - Service Level Indicator
- Measurable KPI
SLO - Service Level Objective
- A goal
SLA - Service Level agreement
- User expectation/agreement
How do we set an SLO?
- Negotiate with users
- Look at historical events
- Design - look at risks
- When in doubt - the SLO is the status-quo (i.e. if you don't know it yet, the SLO is your current service level)
A symptom is anything that can be measured by an SLO
A symptom based alert can be programmed against the SLO
SLOs should be defined in terms of requests (i.e. user-based) instead of time. i.e. instead of x hours uptime, should be y% valid requests successfully fulfilled.
Where to measure?
As close to the user as possible. A load balancer is a good place for a web service measurement, that way you're not measing transactions per server etc.
Mapping out over time, scaling with size, whether an error budget is likely to be exceeded in the longer term. Look at timescale data for long term estimates.
Sometimes errors will ebb and flow, so you shouldn't necessarily alert for temporary spikes. Instead, determine whether the current error rate will significantly exceed the error budget in the long term.
Paging vs SLO
Sometimes an SLO can be breached in short term (e.g. in the last 10 seconds, 10% of queries failed) but long term is OK (in the past hour, 0.0001% of queries failed).
It is great we now have SLOs defined, but how do we actually know what's going on under the hood. With distributed system, network boundaries, process boundaries etc to digest & understand.
Pillars of observability
- logs (pre-formatted events)
- Metrics (prefiltered and preaggregated events)
- Traces (events in a tree structure)
- Exceptions / Stack Traces (mass extinction events)
Changes to system design
All new features and changes to a design, should be done with alerting in mind (just the same as unit testing etc is included, monitoring changes also should be considered).
As new ways of monitoring are devised, don't be afraid to clean up the less-useful ones.
SLOs of alerts per shift
Consider how long it takes to understand the root cause of an alert. In Google's case, they measured this at 6 hours for a particular team, meaning a 12 hour shift should only result in 2 pages (otherwise breaching this SLO)
Non-technical reason ML shouldn't be involved in configuring SLOs - people want to know why they are being disturbed with a page. If this is hidden behind machine learning, they will lose respect for it and stop trusting it.
If pagers are going to wake us up, needs to either have an immediate impact to operations or present a significant threat to the next scheduled business operations (e.g the following day).
Talk - Netflix Incident Management
Speaker: Dave Hahn (Netflix)
Some NetFlix Statistics
hundreds of billions of events per day
tens of billions of requests per day
hundreds of millions of hours of entertainment per day
10s of millions of active devices connected to netflix
millions of containers in Netflix
hundreds of thousands of instances
thousands of production environment changes per day
10s of terrabits of data per day
When someone has an opportunity to be entertained. The moment of truth is when someone chooses to connect to NetFlix. When someone sees the "cannot connect" error on NetFlix, that moment of truth is lost and they do something else.
When netflix shifted from data centers into the cloud, they decided not to "lift and shift", but they decided to completely re-architect their environment. Firstly, they assume that every instance will disappear. Therefore, the inevitable and unexpected loss of one instance should not be noticeable to a customer. Chaos Monkey validates this.
Designing for failure
designing for 100% success is easy
Designing for 100% failure is easy
Designing for grey areas is difficult (i.e. occasional failures)
Introduce X ms artificial latency to y% of requests.
Have tried increasing latency, to 1ms , 50ms, 250ms. It appeared they were resilient to these increases.
At 500ms, customer requests had dropped significantly huge impact.
Dropping %requests impacted back to 0%, didn't fix it
Dropping latency back to 0ms also didn't fix it.
Turns out - the software had 'learnt' to cater for the increased latency, they were in the middle of changes and they had infected an entire service (not just a small portion)
- App behaviour
- Blast Radius
It took a while to regain customer engagement, much longer to recover than designed.
Failure at Velocity
The increased complexity of these environments has made it difficult to keep up with failure scenarios.
Prevention is important, but don't overindex on past failures. Sometimes failures are OK - often there's already something in place (whether retries, etc) at a different layer. A specific failure might require hundreds of things to line up in a particular manner.
Don't overindex on future failures. Sometimes we over-engineer for future failures, but we don't actually understand what the problems will look like and we miss out on opportunities already in front of us.
Invest in resilience.
It needs to be a conscious choice.
New feature development needs to incorporate resilience in line with consumer requirements.
Codify good patterns. Perhaps a shared library etc for something one team finds what works well. The learnings (and pain) one team went through should be usable for other teams.
Invest in further testing to break things intentionally.
Build any system expecting it to fail. "when" not "if" a failure will occur.
Recovery vs Prevention
Sometimes planning quick recovery is a better use of time than designing a complex set of preventions.
Graceful degradation is also worth considering, is there ways less critical components can be disabled whilst the critical things remain operational.
Incident Management at Netflix Velocity
- Short incidents
- Small number of consumers impacted
- Unique failures (don't keep repeating the same ones) - although sometimes recovery is easier. Also ensure you can identify uniqueness quickly
- Ensure incidents are valuable. There are expensive costs associated with outages, we need to get as much value as possible from the incidents.
- There are well defined experts in smaller components. For incident manegement, create a team of failure experts (Core SREs who can provide advice on how to respond to incidents). The Core SREs aren't necessarily deep experts, but can engage the right experts as needed.
Set expectations and provide training
- right equipment
- understand metrics logs and dashboard
- know common things
Education & outreach:
- reach out to the rest of the organisation, designing the incident management workflows and educating them on how you manage the incidents
- Understand how different parts of a business is impacted (sales / legal / finance / developers / service desk / etc)
- Separate engineering teams involved, therefore important to have a central coordinator. The coordinator shouldn't be doing any in-depth engineering.
- Prepare early - train the coordinator how to be effective during an incident.
- Coordination of communication is important.
- Get the right and the same message out there
- Ensure not too many people are involved (mixed messages, noise etc)
- Come back after the incident to understand why the fault occurred, what was effective during the incident, why were you successful in resolving the incident, what could be improved?
Talk - ITOPS - detecting and fixing the "smells"
Speaker: Dave Mangot, Lead SRE at SolarWinds
This talked looked at burnout in IT, ensure people aren't burnt out with pages/escalations from monitoring system.
"Crawl --> Walk --> Run"
It isn't possible to get to a perfect system immediately. Start with Minimum Viable Product then incrementally fix as you go.
"The developers need to care that operations people are being woken at 3 AM". Problems won't get fixed by someone responsible for the design/architecture of the system unless they are acutely aware of the operational impact. Sometimes this requires a bit of the pain to be pushed back their way to make them care.
We "should not be" deploying anything to production if it hasn't been tested.
Although I (personally) would argue there's a limit to how much time should be spent on planning & testing.
Ensure there's a production readiness checklist.
Configure with code
Ensure programmatic and repeatable configuration. APIs for ongoing config/management.
Chaos is not introduced to cause problems, it is done to reveal them
Stage is like Prod
Ensure staging servers are identical to production in every way possible.
Talk - Operations-heavy teams
Presenter: Michael Kehoe (SRE at LinkedIn)
Presenter: Todd Palino (SRE at LinkedIn)
- Backlog of work
- Staff shortage and turnover
Took some SREs and took them out of BAU, so they can focus on identifying and fixing these problems. Largely removing complexity, making infrastructure reliable and ensuring it was well documented.
Exponentially growing messages per day.
5 years to get to 1 Trillion messages per day
2 years to increase to 2 Trillion
1 year to 3 Trillion
6 months to 5 trillion
- Multi Tenant
- no resource controls
- unclear resource ownership
- Ad-hoc capacity planning
- Sudden 100% increase in traffic
- Alerts every 3 minutes
- No time for proactive work
- Most alerts non-actionable
Used "Code Yellow"
- Security team helped
- Dev team fixed some of the problems
- SREs worked on the non-actionable alerts
Code yellow is when a team has a problem and needs help, aiming for up to 3 months to work through a well-defined problem
- Problem statement
- Criteria for getting out of code yellow
- Resource acquisition
Admit there is a problem
Measure & Understand the problem
Determine the underlying causes which need to be fixed
Criteria for success
- SMART Goals
- Concrete success criteria
- Keep the Code Yellow open until it is solved
- Ask other teams for help
- Use a project manager
- Set an exit date for resources
- Plan short-term work
- Plan longer-term projects
- Prioritise anytion which will reduce labour (toil) or will address root cause
- Communicate problem statement and exit criteria
- Send regular project updates (via Project Manager)
- Ensure stakeholders are aware of delays as early as possible
- Measure toil / overhead (costs)
- Prioritisation (something actually needs to be de-prioritised)
- Communicate with partners and teams
How do we prevent code yellows in the future?
Build data feed (dashboard / metrics) allowing someone outside of the SRE team (i.e. not in the weeds) to look more holistically at the regular problems and identify issues earlier, before burnout & wasting time on large amounts of time on the problem.
Talk - Perl
Speaker: Ruth Holloway (Works at cPanel)
Similarities to Python
- Use ~= import
- Function structure in both (syntax is different though(
Tutorial repo here:
This repo should end up being updated periodically - git pull every now and then for updated examples
Some further examples here.
Perl hints & tips
- Will ensure variables behave correctly
- (will tell you if you're doing something silly)
Chomp - will strip off any carriage returns
Whitespace doesn't matter. even if you use carriage return, it will treat it as continuation until you use a semicolon
Postfix expressions (i.e. "unless" at the end) are best to only use when using single line statements. Use an "if not" at the beginning of a block of code instead, otherwise it will be difficult to read