This is the fourth day of the Usenix Annual Technical conference June 2016, held in Denver, Colorado, USA.
As each presentation was delivered by a single researcher, only this main presenter is referenced in the below text. The links to the Usenix conference page provide more information as well as copies of the papers submitted for the conference.
Keynote - A Wardrobe For The Emperor
Bryan Cantrill, Joyent (acquired by Samsung)
The keynote was largely about providing practitioners with a platform for making their work known.
Previously (prior to 2004) the only way for a practitioner to publish content was through paper submission to forums such as Usenix. Given such a low acceptance rate (perhaps 25% of received papers), a percieved bias toward academic writing (i.e. directly related to university research) and very stringent criteria to conform to, it was generally unlikely for a practitioner's work to be published through these forums.
Zero cost blogging provides practitioners a platform tp publish small things that may be interesting to only a very small number of people
With the rise of Youtube - a conference becomes a studio audience. There are more folks viewing the video outside of the conference than at the event itself.
Github happened -
Profound change - a practitioner doesn't talk about any work they have done without a link to Github.
The pace of development, quality of code, the inclusion of contributions from far and wide etc has significantly increased. Additional tools such as automated lint testing, CI integration etc have further accelerated the release cycle of software.
Conference model flaws
Conferences used to be the key mechanism for publishing works. It is flawed now - far too slow and many problems
Journals are not the answer either. Needs disruption, not just a slightly different forum.
How about adding a social media aspect (comments, reviews, badges etc) to arXiv?
Conferences should push more towards incorporating academic and practitioner. Conferences should have people submitting papers from both - the mandate for practical bias in submitted research should ensure it isn't merely academic.
Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading
Kathryn S. McKinley, Microsoft Research
(paper prepared by Xi Yang of Australian National University, Canberra)
Making the most of compute resources
Tail latency matters. Google, for instance, has observed a 400ms delay decreased searches per user by 0.59%.
Low utilisation achieves high responsiveness (less than 100ms for requests). e.g. on a single core SMT system, up to 140 requests per second.
Given the workload is varied, what about using spare resources for other tasks (e.g. batch)
- context switching is expensive
- It can take too long to de-couple the batch workload from the CPU when a request comes in for the latency sensitive workload
When running 1 core and 2 SMT lanes, without any special tuning, the responsiveness for latency sensitive is still too slow.
Now trying round robin SMT allocation.
Still poor performance.
Principled Borrowing - Nanonap
The batch process borrows idle hardware when the latency sensitive workload is idle.
Nanonap releases SMT hardware without releasing SMT context.
OS cannot schedule interrupt
OS cannot allocate resources to other threads
They are using Elfen Scheduling to control nanonap.
The nanonap crossover time can be tuned to be less aggressive but more reliable.
The CPU is being kept busy with batch workloads (100% utilisation) however when latency sensitive workload is assigned to the SMT lane, it is reallocated quickly enough to ensure responsiveness is still fast enough under load. As CPU is kept busy (every cycle is busy), the core is never in a sleep state.
Coherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing
Sharanyan Srikanthan, University of Rochester
When splitting two threads:
SMT (same core, same socket): problem - low-level cache sharing
Same socket (different core) level sharing: problem - Last level cache sharing
Inter-socket sharing: problem - other resources (ring bus, memory etc)
This presentation is a further set of metrics, analysis etc from last year's presentation "SAM"
Design Guidelines for High Performance RDMA Systems
Anuj Kalia - Carnegie Mellon University
This talk looks at clustered environments using RDMA for inter-node transport via RDMA-capable 100Gbps PCIe cards. RDMA is cheap (and fast), therefore needs to be exploited.
Performance depends on complex low-level factors
The extent to which these factors can affect RDMA performance
How to design an RDMA sequencer
RDMA does not use a CPU core when receiving data.
Whenever a client sends a request to the server, the server returns the next set of data in the sequence.
Although the sequencer is simple, it is applicable in other areas (e.g. a key value store)
RDMA Tunable characteristics
Which RDMA ops to use:
- Remote CPU bypass (one-sided)
- fetch and add
- compare and swap
- Remote CPU involved
- doorbell batching
- wqqe shrinking
Various combinations will provide desired performance for certain workloads/use-cases.
For fetch and add, it is a very simple sequencer.
They measured this. Only received 2.2Mbps (comparable to Gigabit Ethernet), even though RDMA providing 100Gbps
How do we resolve this?
- NICs have multiple processing units
- Avoid contention
- Exploit parallelism
- PCI express messages expensive
- Revice CPU to NIC messages (MMIO)
- reduce NIC to CPU (DMS)
NIC on PCI Express is 500ns from L3 cache.
How to reduce contention?
Use CPU cores.
Instead of avoiding CPU cores, they use an RDMA write over RPC
Going from original atomics approach to RPC increases throughput performance from 2.2Mbps to 7Mbps
Next - reduce MMIOs w/doorbell batching
Push mode - CPU intensive pushing messages
Pull - (doorbell batching) - the CPU "rings the doorbell" in the NIC, then the NIC answers by retrieving the messages over RDMA, avoiding the back & forth PCI bus latency between NIC and CPU
Doorbell Batching achieves 16Mbps - an improvement, but still more to go
Exploit NIC parallelism with multi queues
When the queues were increased from the NIC to 3 x parallel queues, this increased throughput to 27.4Mbps
Increase to 6 x CPU cores
An increase in the cores to 6 x cores resulted in 97.2 Mbps. Almost the desired 50x increase, not quite there.
Reduce DMA size - header only
Move the payload into the 64-bit header, drop the remainder of the datagram
This increased the throughput to 122Mbps
Balancing CPU and Network in the Cell Distributed B-Tree Store
Chris Mitchell - New York University
The server maintains locality of data and computation
The server will receive a request, will process the request and return the desired data.
Problem: server is CPU-bound
Load spikes result in saturated DPUs
- over-provision servers (wasteful)
- spin up extra servers (slow)
Relaxed locality by processing requests at client
- clients fetch the required state first
- RDMA enables client-side processing.
We don't always want to use client side processing.
Problem: server is NIC bound
If NIC is bottleneck - use server-side processing
- if you have excess server CPU, best to use it.
Selectively relaxing locality improves load balancing. Look at load and determine whether to process on client or server side
The researchers have called their solution "Cell" - a distributed B-Tree Store
Design choice 1
only send 'read' activities to clients
Client and server
- only when the server is under CPU load
- only when the server is under CPU load
server side only
- no balancing
- no distributed locks
Challenge - there's no easy way to order client-side reads to be RDMA aware of server-side updates
Solution - make the reads/writes atomic - versioning (atomic number) on the data.
Design option 2
Choosing when to use client and server side
- naive: pick lowest latency
- sub optional - need to keep NIC and CPUs occupied
- properly weighting operations
0 extremely short transient conditions - outliers, moving average
- stale measurements
Design option 3
client side locality selector
C++, 16,000 lines of code
Infiniband with TCP connection mode
Is selectively relaxed locality fast? yes
- 170% performance improvement compared to strict server-side only
- 2 cores per NIC saving
Selectively relax locality is faster than client side or server side lane - yes
- server side only top 150K operations
- client side only top 400K operations
- using both achieved over 600K operations
Using mixed workloads (insert, update, get, etc)
cell was a winner in all mixed workload use cases tested
Does it handle load spikes? yes
latency significantly lower when using less utilised client search (switched over within 200ms), continued this for the 5-second spike.
Tomorrow's data centres will include RDMA capable, ultra low-latency networks
New system architectures:
- selectively relaxed locality for load balancing and CPU efficiency
- Self-verifying data structures make this practical
- locality-relaxation techniques work at scale
An Evolutionary study of Linux memory management for fun and profit
Jian Huang - Georgia Tech
Virtual memory has been developed for over 60 years
As more features have been added and new hardware supported, system reliability has become more of a concern.
This study was a statistical look at the optimisation of memory management, with some analysis on why.
Findings in memory management since 2.6 kernel:
Far too many stats to include in these notes, but some highlights are below. Worth reading through the paper for more details, explanations etc.
- Code changes are highly concentrated around key functions
- 80% of patches --> 25% of total source code
- 60% increase in lines of code
- 13 "hot files" out of 90 files (i.e. development has focussed on 13 files)
- Percentage of bug fixes is increasing and percentage of new features is decreasing (now the focus is largely around bug fixes).
- Main bugs focus on fixing errors in
- memory allocation, VMM, Garbage Collection
Getting Back Up - understanding how enterprise data backups fail
George Amvrosiadis - University of Toronto
Backups in the news
123-reg erases customer data. no backups
Salesforce loses 4 hours of data. backup incomplete
Business surveys - Backups fail often
27% have lost data due to backup errors
80% have trouble configuring backup software
Collection of data
Partnered with Veritas to collect telemetry data across their customer base
775M jobs from 20,000 installations over 3 years
- 604M data backup jobs
- 105M data management (e.g. copy to another media)
- 6.3M data recovery
Conclusion from data:
- Jobs fail often (10 - 30% of failed jobs)
- Early adopters of new versions of backup software are subject to higher numbers of backup failures
- As the release becomes stable, failures level out around 9%
Errors are not very diverse
- 333 distinct error codes (28% of all documented error codes)
- testing insufficient - 59 of these codes only showed up in production (not during testing phase)
- 64% of errors are for 5 error codes
- 2% insufficient backup windows (job can never be scheduled)
- 7% device full (can't back up data)
- 11% No tapes available
- 15% Invalid filesystem block size or max file size breached
- 25% partial backup due to file permissions
Categorised by three causes
- 75% configuration errors prevail - errors can be resolved by taking specific steps (change parameters etc)
- 22% System errors
- 2% non fatal errors
We need better configuration validation tools or self-healing mechanisms
Categorised by job type
- 28% of all management jobs fail
- 9% of all backup jobs fail
- 2% of all recovery jobs fail
Categorised by Job sizes
- Larger jobs are more likely to fail - except for management jobs (as management jobs are sometimes metadata or deletion jobs)
Backup often to avoid large jobs. Verify integrity of large backup jobs often.
Complexity breeds error diversity
- backup policies ensure consistent data backups
- config parameters differ by policy
As the number of policies increase, the number of different error codes increase
Lesson: Design and prefer simpler (less) backup policies
Error code prediction
Historical data is insufficient for error prediction
- high variability in the inter-arrival times of most errors
They took a Machine Learning approach to evaluate some possibilities for prediction
Where to from here:
More targetted error prediction
Config automation instead of defaults
Application-specific configuration validation
Work reduction to reduce needed downtime. If a 12-hour backup window isn't enough, let's look at ways of backing up less data.