Usenix ATC 2016 - Day 4

This is the fourth day of the Usenix Annual Technical conference June 2016, held in Denver, Colorado, USA.

As each presentation was delivered by a single researcher, only this main presenter is referenced in the below text. The links to the Usenix conference page provide more information as well as copies of the papers submitted for the conference.

Keynote - A Wardrobe For The Emperor

Bryan Cantrill, Joyent (acquired by Samsung)

https://www.usenix.org/conference/atc16/technical-sessions/presentation/cantrill

The keynote was largely about providing practitioners with a platform for making their work known.

Academic forums

Previously (prior to 2004) the only way for a practitioner to publish content was through paper submission to forums such as Usenix. Given such a low acceptance rate (perhaps 25% of received papers), a percieved bias toward academic writing (i.e. directly related to university research) and very stringent criteria to conform to, it was generally unlikely for a practitioner's work to be published through these forums.

Blogging happened

Zero cost blogging provides practitioners a platform tp publish small things that may be interesting to only a very small number of people

Youtube happened

With the rise of Youtube - a conference becomes a studio audience. There are more folks viewing the video outside of the conference than at the event itself.

Github happened -

Profound change - a practitioner doesn't talk about any work they have done without a link to Github.
The pace of development, quality of code, the inclusion of contributions from far and wide etc has significantly increased. Additional tools such as automated lint testing, CI integration etc have further accelerated the release cycle of software.

Conference model flaws

Conferences used to be the key mechanism for publishing works. It is flawed now - far too slow and many problems

Journals are not the answer either. Needs disruption, not just a slightly different forum.

How about adding a social media aspect (comments, reviews, badges etc) to arXiv?

Conferences should push more towards incorporating academic and practitioner. Conferences should have people submitting papers from both - the mandate for practical bias in submitted research should ensure it isn't merely academic.

Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading

Kathryn S. McKinley, Microsoft Research
(paper prepared by Xi Yang of Australian National University, Canberra)

https://www.usenix.org/conference/atc16/technical-sessions/presentation/yang

Making the most of compute resources

Tail latency matters. Google, for instance, has observed a 400ms delay decreased searches per user by 0.59%.

Low utilisation achieves high responsiveness (less than 100ms for requests). e.g. on a single core SMT system, up to 140 requests per second.

Given the workload is varied, what about using spare resources for other tasks (e.g. batch)

  • context switching is expensive
  • It can take too long to de-couple the batch workload from the CPU when a request comes in for the latency sensitive workload

Performance tests

When running 1 core and 2 SMT lanes, without any special tuning, the responsiveness for latency sensitive is still too slow.

Now trying round robin SMT allocation.
Still poor performance.

Principled Borrowing - Nanonap

The batch process borrows idle hardware when the latency sensitive workload is idle.

Nanonap releases SMT hardware without releasing SMT context.
OS cannot schedule interrupt
OS cannot allocate resources to other threads

They are using Elfen Scheduling to control nanonap.

The nanonap crossover time can be tuned to be less aggressive but more reliable.

Results

The CPU is being kept busy with batch workloads (100% utilisation) however when latency sensitive workload is assigned to the SMT lane, it is reallocated quickly enough to ensure responsiveness is still fast enough under load. As CPU is kept busy (every cycle is busy), the core is never in a sleep state.

Coherence Stalls or Latency Tolerance: Informed CPU Scheduling for Socket and Core Sharing

Sharanyan Srikanthan, University of Rochester

https://www.usenix.org/conference/atc16/technical-sessions/presentation/srikanthan

The problem

When splitting two threads:
SMT (same core, same socket): problem - low-level cache sharing
Same socket (different core) level sharing: problem - Last level cache sharing
Inter-socket sharing: problem - other resources (ring bus, memory etc)

This presentation is a further set of metrics, analysis etc from last year's presentation "SAM"

https://www.usenix.org/conference/atc15/technical-session/presentation/srikanthan

Design Guidelines for High Performance RDMA Systems

Anuj Kalia - Carnegie Mellon University

This talk looks at clustered environments using RDMA for inter-node transport via RDMA-capable 100Gbps PCIe cards. RDMA is cheap (and fast), therefore needs to be exploited.

Problem

Performance depends on complex low-level factors

The extent to which these factors can affect RDMA performance

How to design an RDMA sequencer

RDMA does not use a CPU core when receiving data.
Whenever a client sends a request to the server, the server returns the next set of data in the sequence.
Although the sequencer is simple, it is applicable in other areas (e.g. a key value store)

RDMA Tunable characteristics

Which RDMA ops to use:

  • Remote CPU bypass (one-sided)
  • read
  • writes
  • fetch and add
  • compare and swap
  • Remote CPU involved
  • Send
  • Recv

Transport options:

  • reliable
  • unreliable
  • connected
  • datagram

optimisations:

  • inligned
  • unsignaled
  • doorbell batching
  • wqqe shrinking
  • 0B-RECVs

Various combinations will provide desired performance for certain workloads/use-cases.

For fetch and add, it is a very simple sequencer.
They measured this. Only received 2.2Mbps (comparable to Gigabit Ethernet), even though RDMA providing 100Gbps

How do we resolve this?

Guidelines:

  • NICs have multiple processing units
  • Avoid contention
  • Exploit parallelism
  • PCI express messages expensive
  • Revice CPU to NIC messages (MMIO)
  • reduce NIC to CPU (DMS)

NIC on PCI Express is 500ns from L3 cache.

How to reduce contention?
Use CPU cores.
Instead of avoiding CPU cores, they use an RDMA write over RPC
This

Going from original atomics approach to RPC increases throughput performance from 2.2Mbps to 7Mbps

Next - reduce MMIOs w/doorbell batching

Push mode - CPU intensive pushing messages

Pull - (doorbell batching) - the CPU "rings the doorbell" in the NIC, then the NIC answers by retrieving the messages over RDMA, avoiding the back & forth PCI bus latency between NIC and CPU

Doorbell Batching achieves 16Mbps - an improvement, but still more to go

Exploit NIC parallelism with multi queues

When the queues were increased from the NIC to 3 x parallel queues, this increased throughput to 27.4Mbps

Increase to 6 x CPU cores

An increase in the cores to 6 x cores resulted in 97.2 Mbps. Almost the desired 50x increase, not quite there.

Reduce DMA size - header only

Move the payload into the 64-bit header, drop the remainder of the datagram
This increased the throughput to 122Mbps

Balancing CPU and Network in the Cell Distributed B-Tree Store

Chris Mitchell - New York University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/mitchell

The server maintains locality of data and computation
The server will receive a request, will process the request and return the desired data.

Problem: server is CPU-bound

Load spikes result in saturated DPUs

Options:

  • over-provision servers (wasteful)
  • spin up extra servers (slow)

Relaxed locality by processing requests at client

  • clients fetch the required state first
  • RDMA enables client-side processing.

We don't always want to use client side processing.

Problem: server is NIC bound

If NIC is bottleneck - use server-side processing

  • if you have excess server CPU, best to use it.

Solution

Selectively relaxing locality improves load balancing. Look at load and determine whether to process on client or server side

The researchers have called their solution "Cell" - a distributed B-Tree Store

Design choice 1

only send 'read' activities to clients

Client and server

  • search
  • only when the server is under CPU load
  • scan
  • only when the server is under CPU load

server side only

  • insert
  • delete
  • no balancing
  • no distributed locks

Challenge - there's no easy way to order client-side reads to be RDMA aware of server-side updates

Solution - make the reads/writes atomic - versioning (atomic number) on the data.

Design option 2

Choosing when to use client and server side

  • naive: pick lowest latency
  • sub optional - need to keep NIC and CPUs occupied

pitfalls

  • properly weighting operations
    0 extremely short transient conditions - outliers, moving average
  • stale measurements
Design option 3

client side locality selector

Testing
C++, 16,000 lines of code
Infiniband with TCP connection mode

Results

Is selectively relaxed locality fast? yes
  • 170% performance improvement compared to strict server-side only
  • 2 cores per NIC saving
Selectively relax locality is faster than client side or server side lane - yes
  • server side only top 150K operations
  • client side only top 400K operations
  • using both achieved over 600K operations

Using mixed workloads (insert, update, get, etc)
cell was a winner in all mixed workload use cases tested

Does it handle load spikes? yes

latency significantly lower when using less utilised client search (switched over within 200ms), continued this for the 5-second spike.

Conclusions:

Tomorrow's data centres will include RDMA capable, ultra low-latency networks
New system architectures:

  • selectively relaxed locality for load balancing and CPU efficiency
  • Self-verifying data structures make this practical
  • locality-relaxation techniques work at scale

An Evolutionary study of Linux memory management for fun and profit

Jian Huang - Georgia Tech

https://www.usenix.org/conference/atc16/technical-sessions/presentation/huang

Virtual memory has been developed for over 60 years

As more features have been added and new hardware supported, system reliability has become more of a concern.

This study was a statistical look at the optimisation of memory management, with some analysis on why.

Findings in memory management since 2.6 kernel:

Far too many stats to include in these notes, but some highlights are below. Worth reading through the paper for more details, explanations etc.

  • Code changes are highly concentrated around key functions
  • 80% of patches --> 25% of total source code
  • 60% increase in lines of code
  • 13 "hot files" out of 90 files (i.e. development has focussed on 13 files)
  • Percentage of bug fixes is increasing and percentage of new features is decreasing (now the focus is largely around bug fixes).
  • Main bugs focus on fixing errors in
  • memory allocation, VMM, Garbage Collection

Getting Back Up - understanding how enterprise data backups fail

George Amvrosiadis - University of Toronto

https://www.usenix.org/conference/atc16/technical-sessions/presentation/amvrosiadis

Backups in the news
123-reg erases customer data. no backups
Salesforce loses 4 hours of data. backup incomplete

Business surveys - Backups fail often
27% have lost data due to backup errors
80% have trouble configuring backup software

Collection of data

Partnered with Veritas to collect telemetry data across their customer base
775M jobs from 20,000 installations over 3 years

  • 604M data backup jobs
  • 105M data management (e.g. copy to another media)
  • 6.3M data recovery

Conclusion from data:

  • Jobs fail often (10 - 30% of failed jobs)
  • Early adopters of new versions of backup software are subject to higher numbers of backup failures
  • As the release becomes stable, failures level out around 9%
Errors are not very diverse
  • 333 distinct error codes (28% of all documented error codes)
  • testing insufficient - 59 of these codes only showed up in production (not during testing phase)
  • 64% of errors are for 5 error codes
  • 2% insufficient backup windows (job can never be scheduled)
  • 7% device full (can't back up data)
  • 11% No tapes available
  • 15% Invalid filesystem block size or max file size breached
  • 25% partial backup due to file permissions
Categorised by three causes
  • 75% configuration errors prevail - errors can be resolved by taking specific steps (change parameters etc)
  • 22% System errors
  • 2% non fatal errors

We need better configuration validation tools or self-healing mechanisms

Categorised by job type
  • 28% of all management jobs fail
  • 9% of all backup jobs fail
  • 2% of all recovery jobs fail
Categorised by Job sizes
  • Larger jobs are more likely to fail - except for management jobs (as management jobs are sometimes metadata or deletion jobs)

Backup often to avoid large jobs. Verify integrity of large backup jobs often.

further notes

Complexity breeds error diversity

  • backup policies ensure consistent data backups
  • config parameters differ by policy

As the number of policies increase, the number of different error codes increase

Lesson: Design and prefer simpler (less) backup policies

Error code prediction

Historical data is insufficient for error prediction

  • high variability in the inter-arrival times of most errors

They took a Machine Learning approach to evaluate some possibilities for prediction

Where to from here:

More targetted error prediction
Config automation instead of defaults
Application-specific configuration validation
Work reduction to reduce needed downtime. If a 12-hour backup window isn't enough, let's look at ways of backing up less data.