I attended the USENIX LISA (Large Installation Systems Administration) conference in Nashville, Tennessee from 29th to 31st October 2018. Aside from heaps of awesome live country music, some delicious southern food and many hours in at HeadQuarters Beercade for pinball & arcade games, below are some notes from the conference.
https://www.usenix.org/conference/lisa18/presentation/masters
Speaker: Jon Masters (RedHat)
2018 - everyone has been "super excited" about patching their systems every 10 minutes. Meltdown, Spectre, Heartbleed, Bitlocker, etc.
How did we get to this security mess?
Hardware and software devs generally don't talk to each other a great deal.
Trends moving away from this diversification - hardware and software more interlinked.
An architecture is a set of ISA specifications which defines how similar systems can talk to each other. e.g. PA-RISC, x86, SPARC etc. It governs instruction sets and allows code to be compiled against it.
Within architecture, various software layers
Apps
Operating Systems
Hypervisors etc
Applications are targetting the ISA specification. Traditionally, software was written to manage physical memory mappings etc. Nowadays a Memory Manager handles this abstraction layer and the applications no longer care. Operating Systems are designed to provide the illusion to software that the software has its own dedicated memory space etc, even though it is abstracted.
Processor chips containing cache, cores etc and aligned with DDR memory. Caching is all about bringing data you need closer to where you need it. Data coming into the cache all the time, getting replaced all the time when the core is working on something else.
Microarchitecture refers to a specific implementation of an architecture. e.g. within x86 there is Skylake.
In-Order microarchitectures - designed with assumptions about sequential processing order to maximise performance. Generally stalls when waiting for something, resulting in high latency
Out of Order microarchitecrures can determine where there is no dependency between certain instruction sets and separates them. The application cannot tell the difference as it returns the results back to the application in-order. It does this by taking the instructions, load them in a Re-Order buffer and determine the dependencies. It then allows for execution of instructions as soon as dependencies are satisfied, instead of waiting sequentially. This allows for better throughput and lower latency.
In order to increase efficiency, some predictions are made as to what might be returned. Instead of sitting around to wait for a value, the program can run ahead. This requires keeping track of all the guesses/predictions made, then allow an 'undo' if the prediction was incorrect. Speculative Execution adds an extra column to the re-order buffer and uses spare processing capacity to guess a result and keep executing with an assumption (but can go back if the assumption is wrong). Minmal loss as the processor would otherwise be idle waiting for the dependency.
If correct - great, a massive performance increase.
If incorrect, there is a small performance hit due to flushing the incorrect resulting calculations.
Branch predictor uses history buffer to get better at guessing which way a branch will go.
Indirect prediction uses less logic to "just do stuff" with spare capacity.
Applications have a view that they have memory to themselves.
There is also a kernel memory space which generally sits higher (more important) than app memory space.
Virtual Memory Manager maps memory addresses to physical memory. There's a cache (TLB) in the processor to make this faster.
Virtual memory lookups require even more lookup stages
In the case of a hypervisor, the steps are doubled (as the requests are passed from the guest OS to the hypervisor)
lots of mechanisms for making this more fficient
Any way we can monitor somethign and infer what it is doing.
E.g. taking a bank card and measuring the electrical voltages, making assumptions as to what those voltages are doing.
For caches, I can time how long a fetch takes and determine whether it was in the cache or not.
There are ways we can pull things into caches intentionally too.
Vendor responses are on a specific timeline. Limited time to create, test and prepare to deploy mitigations. Have to make a lot of materials fo rthis etc.
Use a combination of interfaces processors have provided. Use a combination of microcode, millicode and software.
Branch Predictors are shared between applications
When an out of order set of instructions are performed. Because the processor is speculating and running ahead. I can briefly access the cache. But doing a cache analysis, I can infer the content of the cache.
Fix meltdown:
prevent data from being there
Page Table Isolation - prevent userland from accessing the cache. Slowdown but allowing better security
What is it?
Abuses speculation and allows reading further than I should be able to.
Mistrain the branch predictors (resulting in the desired results being made available in the shared speculative cache)
Fix Spectre v1
Stop speculation
Safely force the speculation to a safe value
When switching from one app to another - a malicious application could infer the floating point cached value of the previous application.
How to change moving forward
Need to change how the hardware is built
Need to change how the hardware/software communities engage with each other (better collaboration)
https://www.usenix.org/conference/lisa18/presentation/mon-keynote-3
Speaker: Tameika Reed (Founder of Women In Linux)
How do you get into Linux and stay in it or pivot & do something else.
Sys Admin skills, even if the landscape has changed, are still valid in the past, present and future:
problem solving analytic
Virtualisation
Cloud
Automaton
Performance / Tuning
Testing
Security (software / network / physical / operational)
Scripting
Communication
Networking
What skillsets did I start with (on day 1, 1998). What were those skillsets 3 months later, a year later, 3 years, 20 years. Important that these skills evolve. Don't necessarily need to know where to go, but need to be open to learning new things ALWAYS
Plenty of past skillsets which still apply - NFS, SCSI, TFTP, PXE, etc.
side note - selinux and firewalld seem pretty big / common here (lots of ppl using them).
Understanding your customers
A tier 1 / tier 2 admin (e.g. if you're an MSP)
Internal
Business external customers
Are you doing everything (network, hardware, backup, Linux, Windows, DB) or are you specialised?
Infrastructure and Automation Engineer moves away from these other platforms directly, but embraces APIs to instruct the other platforms. Includes
Virtualisation
Monitoring
Backup
Docymentation
Automation
CI/CD
Security
Modern Sysadmins need to become familiar with System Architecture design (from a 30,000 foot view).
Planning / deployment
Agile / ITIL
Security, backup, tech roadmap etc
Vendor engagement
Onsite or offsite or cloud
Strong coding background
Understanding SLA/SLO/etc
Understand CAP Theorem
Incident Management
Postmortems
Distributed Systems (horizontal / vertical)
Simulating Workloads
Testing hardware conditions
Identify performance and availability issues
Collaboration with all stakeholders
Understanding microservices
Analytics, visualisations of data
Testing / CICD
Netflix blog has some good blog content on chaos theory.
Being able to have a digital ID attached to where something has been, who worked on it, the state before & after, and have an immutable record of it. This is logging / auditing.
Example given was car servicing. Someone buys a used car, gets his mate to change the oil but forgot to put a screw on, then trying to blame the car dealer.
HPC
Cryptography (e.g. using a single photon as a private key - detecting eavesdropping if the photon does not reach the destination)
Qubit (Quantum Entanglement)
HPC
DevSecOps
IOT
Gaming
Automotive Grade Linux (vehicle automation)
The skillsets from the past are still applicable but have evolved
Look at problems from 30,000 ft view
Read the documentation
Try the opensource version of a product
Don't need to work at a big company to get good experience
Keep an eye on market trends (webinars, conferences, blogs, magazines, tech news etc)
https://www.usenix.org/conference/lisa18/presentation/wilkinson
Speaker: Jamie Wilkinson (Google Australia SRE)
There's a lot of anxiety around on-call. Lots of the same things repeatedly. Lots of interruptions, pager alerts etc.
We go on call to maintain reliability of services.
The brain should be used for doing things which haven't been done before (i.e. not solved before) rather than repeated simple faults.
Paging should be on what users care about, not on what is broken.
Rapid rate of change on a system means the on-call workload continually grows for a system, often not related to the size of a system.
We are trying to maintain the system being monitored AS WELL AS the monitoring platform.
At google - they are capping on-call work at 50% of time.
Paging on call is prone to generating too much noise. Paging should focus more on pre-defined risks (e.g. :"replication has fallen behind by X amount" or "spare disks in the array has dropped below Y")
This is a matter of perspective. A Linux sysadmin symptom will probably look different to a user symptom.
For instance, if a front-end web node drops out, the user may observe higher latency, but sysadmin will not see the latency - instead sysadmin will look at tech symptoms (ping to a node, logs, etc)
The acceptable level of errors, availablity loss, performance degradation etc.
Sometimes the budget is used by external factors (natural disasters, bugs, etc). Other times it is used by accidental impact (user/administrator error) or scheduled maintenance.
SLI - Service Level Indicator
Measurable KPI
SLO - Service Level Objective
A goal
SLA - Service Level agreement
User expectation/agreement
Negotiate with users
Look at historical events
Design - look at risks
When in doubt - the SLO is the status-quo (i.e. if you don't know it yet, the SLO is your current service level)
A symptom is anything that can be measured by an SLO
A symptom based alert can be programmed against the SLO
SLOs should be defined in terms of requests (i.e. user-based) instead of time. i.e. instead of x hours uptime, should be y% valid requests successfully fulfilled.
As close to the user as possible. A load balancer is a good place for a web service measurement, that way you're not measing transactions per server etc.
Mapping out over time, scaling with size, whether an error budget is likely to be exceeded in the longer term. Look at timescale data for long term estimates.
Sometimes errors will ebb and flow, so you shouldn't necessarily alert for temporary spikes. Instead, determine whether the current error rate will significantly exceed the error budget in the long term.
Sometimes an SLO can be breached in short term (e.g. in the last 10 seconds, 10% of queries failed) but long term is OK (in the past hour, 0.0001% of queries failed).
It is great we now have SLOs defined, but how do we actually know what's going on under the hood. With distributed system, network boundaries, process boundaries etc to digest & understand.
logs (pre-formatted events)
Metrics (prefiltered and preaggregated events)
Traces (events in a tree structure)
Exceptions / Stack Traces (mass extinction events)
All new features and changes to a design, should be done with alerting in mind (just the same as unit testing etc is included, monitoring changes also should be considered).
As new ways of monitoring are devised, don't be afraid to clean up the less-useful ones.
Consider how long it takes to understand the root cause of an alert. In Google's case, they measured this at 6 hours for a particular team, meaning a 12 hour shift should only result in 2 pages (otherwise breaching this SLO)
Non-technical reason ML shouldn't be involved in configuring SLOs - people want to know why they are being disturbed with a page. If this is hidden behind machine learning, they will lose respect for it and stop trusting it.
If pagers are going to wake us up, needs to either have an immediate impact to operations or present a significant threat to the next scheduled business operations (e.g the following day).
https://www.usenix.org/conference/lisa18/presentation/hahn
Speaker: Dave Hahn (Netflix)
hundreds of billions of events per day
tens of billions of requests per day
hundreds of millions of hours of entertainment per day
10s of millions of active devices connected to netflix
millions of containers in Netflix
hundreds of thousands of instances
thousands of production environment changes per day
10s of terrabits of data per day
When someone has an opportunity to be entertained. The moment of truth is when someone chooses to connect to NetFlix. When someone sees the "cannot connect" error on NetFlix, that moment of truth is lost and they do something else.
When netflix shifted from data centers into the cloud, they decided not to "lift and shift", but they decided to completely re-architect their environment. Firstly, they assume that every instance will disappear. Therefore, the inevitable and unexpected loss of one instance should not be noticeable to a customer. Chaos Monkey validates this.
designing for 100% success is easy
Designing for 100% failure is easy
Designing for grey areas is difficult (i.e. occasional failures)
Introduce X ms artificial latency to y% of requests.
Have tried increasing latency, to 1ms , 50ms, 250ms. It appeared they were resilient to these increases.
At 500ms, customer requests had dropped significantly huge impact.
Dropping %requests impacted back to 0%, didn't fix it
Dropping latency back to 0ms also didn't fix it.
Turns out - the software had 'learnt' to cater for the increased latency, they were in the middle of changes and they had infected an entire service (not just a small portion)
App behaviour
Blast Radius
Consistency
It took a while to regain customer engagement, much longer to recover than designed.
The increased complexity of these environments has made it difficult to keep up with failure scenarios.
Prevention is important, but don't overindex on past failures. Sometimes failures are OK - often there's already something in place (whether retries, etc) at a different layer. A specific failure might require hundreds of things to line up in a particular manner.
Don't overindex on future failures. Sometimes we over-engineer for future failures, but we don't actually understand what the problems will look like and we miss out on opportunities already in front of us.
It needs to be a conscious choice.
New feature development needs to incorporate resilience in line with consumer requirements.
Codify good patterns. Perhaps a shared library etc for something one team finds what works well. The learnings (and pain) one team went through should be usable for other teams.
Invest in further testing to break things intentionally.
Build any system expecting it to fail. "when" not "if" a failure will occur.
Sometimes planning quick recovery is a better use of time than designing a complex set of preventions.
Graceful degradation is also worth considering, is there ways less critical components can be disabled whilst the critical things remain operational.
Short incidents
Small number of consumers impacted
Unique failures (don't keep repeating the same ones) - although sometimes recovery is easier. Also ensure you can identify uniqueness quickly
Ensure incidents are valuable. There are expensive costs associated with outages, we need to get as much value as possible from the incidents.
There are well defined experts in smaller components. For incident manegement, create a team of failure experts (Core SREs who can provide advice on how to respond to incidents). The Core SREs aren't necessarily deep experts, but can engage the right experts as needed.
Set expectations and provide training
right equipment
understand metrics logs and dashboard
know common things
reach out to the rest of the organisation, designing the incident management workflows and educating them on how you manage the incidents
Understand how different parts of a business is impacted (sales / legal / finance / developers / service desk / etc)
Separate engineering teams involved, therefore important to have a central coordinator. The coordinator shouldn't be doing any in-depth engineering.
Prepare early - train the coordinator how to be effective during an incident.
Coordination of communication is important.
Get the right and the same message out there
Ensure not too many people are involved (mixed messages, noise etc)
Come back after the incident to understand why the fault occurred, what was effective during the incident, why were you successful in resolving the incident, what could be improved?
https://www.usenix.org/conference/lisa18/presentation/mangot
Speaker: Dave Mangot, Lead SRE at SolarWinds
This talked looked at burnout in IT, ensure people aren't burnt out with pages/escalations from monitoring system.
"Crawl --> Walk --> Run"
It isn't possible to get to a perfect system immediately. Start with Minimum Viable Product then incrementally fix as you go.
"The developers need to care that operations people are being woken at 3 AM". Problems won't get fixed by someone responsible for the design/architecture of the system unless they are acutely aware of the operational impact. Sometimes this requires a bit of the pain to be pushed back their way to make them care.
We "should not be" deploying anything to production if it hasn't been tested.
Although I (personally) would argue there's a limit to how much time should be spent on planning & testing.
Ensure there's a production readiness checklist.
Ensure programmatic and repeatable configuration. APIs for ongoing config/management.
Chaos is not introduced to cause problems, it is done to reveal them
Ensure staging servers are identical to production in every way possible.
https://www.usenix.org/conference/lisa18/presentation/kehoe
Presenter: Michael Kehoe (SRE at LinkedIn)
Presenter: Todd Palino (SRE at LinkedIn)
"Code Yellow"
Backlog of work
Staff shortage and turnover
Took some SREs and took them out of BAU, so they can focus on identifying and fixing these problems. Largely removing complexity, making infrastructure reliable and ensuring it was well documented.
Exponentially growing messages per day.
5 years to get to 1 Trillion messages per day
2 years to increase to 2 Trillion
1 year to 3 Trillion
6 months to 5 trillion
Problem:
Multi Tenant
no resource controls
unclear resource ownership
Ad-hoc capacity planning
Sudden 100% increase in traffic
Alert fatigue
Alerts every 3 minutes
No time for proactive work
Most alerts non-actionable
Solution:
Used "Code Yellow"
Security team helped
Dev team fixed some of the problems
SREs worked on the non-actionable alerts
Code yellow is when a team has a problem and needs help, aiming for up to 3 months to work through a well-defined problem
Problem statement
Criteria for getting out of code yellow
Resource acquisition
Planning
Communication
Admit there is a problem
Measure & Understand the problem
Determine the underlying causes which need to be fixed
SMART Goals
Concrete success criteria
Keep the Code Yellow open until it is solved
Ask other teams for help
Use a project manager
Set an exit date for resources
Plan short-term work
Plan longer-term projects
Prioritise anytion which will reduce labour (toil) or will address root cause
Communicate problem statement and exit criteria
Send regular project updates (via Project Manager)
Ensure stakeholders are aware of delays as early as possible
Measure toil / overhead (costs)
Prioritisation (something actually needs to be de-prioritised)
Communicate with partners and teams
Build data feed (dashboard / metrics) allowing someone outside of the SRE team (i.e. not in the weeds) to look more holistically at the regular problems and identify issues earlier, before burnout & wasting time on large amounts of time on the problem.
https://www.usenix.org/conference/lisa18/presentation/holloway
Speaker: Ruth Holloway (Works at cPanel)
Use ~= import
Function structure in both (syntax is different though(
https://github.com/GeekRuthie/Perl_sysadmin
This repo should end up being updated periodically - git pull every now and then for updated examples
https://www.cpan.org/scripts/UNIX/System_administration/index.html
Some further examples here.
use strict;
Will ensure variables behave correctly
use warnings;
(will tell you if you're doing something silly)
Chomp - will strip off any carriage returns
Whitespace doesn't matter. even if you use carriage return, it will treat it as continuation until you use a semicolon
Postfix expressions (i.e. "unless" at the end) are best to only use when using single line statements. Use an "if not" at the beginning of a block of code instead, otherwise it will be difficult to read