Usenix ATC 2016

This is the third day of the Usenix Annual Technical conference June 2016, held in Denver, Colorado, USA. Today the main conference commenced (after the HotStorage/HotCloud workshop series concluded).

As each presentation was delivered by a single researcher, only this main presenter is referenced in the below text. The links to the Usenix conference page provide more information as well as copies of the papers submitted for the conference.

Keynote - The Future of Infrastructure

Martin Casado

https://www.usenix.org/conference/atc16/technical-sessions/presentation/casado

How VCs come up with investment decisions

The prevailing view is that infrastructure is dead. (traditional infrastructure).
Many large infra companies (e.g. IBM) are seeing massive downturns in revenue.
AWS/Azure/Softlayer/etc is directing efforts towards focussing on the software/app layer, leave infra to someone else.

IT is a very large market - $4 trillion per year (not sure if US or global)
IT has been hugely disrupted. The disruption is only getting started.
$220 billion in cloud, only 6% of the total IT market.

Massive change at a business level - IT products are no longer sold B2B etc, they are being sold directly to EVERYONE (especially with smartphones, Internet Of Things etc).

The market size of IT is further growing due to the increased number of devices connected (IOT, smartphones, etc).

It is very difficult for companies to cannibalise themselves so they can pivot in a different direction. Some have managed to do this well e.g. Honeywell now make thermostats, but many are not able to adapt to changing IT landscapes.

Software defined movement

There are now many Software Defined movements. For instance, the smartphone has disrupted many fixed-function devices - gaming devices, GPS, mail clients, telephones, etc.

This was further driven by a significant global uptake - "everyone has an iPhone", therefore apps and features are further developed for the emerging platform.

APIs

Previously the only way to introduce functionality was to wrap it in sheet metal and delver a fixed function device (e.g. a TomTom GPS, a Casio calculator, etc). The only standard 'API' was IP (networking protocols). As a startup, there was only one way to deliver something for it to be adopted.

Now, there are many layers of integration (hypervisor, platform, OS, application, container) and commonly adopted API driven frameworks allowing rapid development and widespread consumption.

Nicira story

Founded in 2007
Implemented a platform where all networking was defined at a software layer
received $40M government funding to build software network switch + controller
$1.26B acquisition by VMWare in 2012
VMWare Networking is now a $600M p.a. business in 2015

Why did they develop software-defined?

One of the limitations in networking was that features/functionality was always defined by what the vendor was willing to offer and was typically heavily tied to physical hardware.
Software-defined networking was an answer to this - if you need new features, it is as simple as writing the code.

Traditionally when people think about IT they think about hardware. The supply chain knows how to sell hardware. There was a perception that you couldn't simply create software-defined infrastructure.

Software disrupts the Infrastructure Delivery model

Traditional:

Supply chain
physical design and manufacture
Inventory management
Delivery
Physical box
Costly trials
Product insertion
Rip & replace
Costly to revert

Software-defined

Supply Chain
N/A
Delivery
Download & Trial
Product insertion
flexible low-cost implementations (parallel, cutover etc)
Safe/easy snapshot revert

Software As A Service

Applications made this transition in the late 2000s

Old world: SAP (on-premise)
New World: Salesforce (cloud)

New world also has better hooks - allows integration into other services at all layers (security, availability, new features etc) without the complexity of supporting physical hardware. Also provides better opportunities for developer and sysadmin support.

Rise of the developers

With the move to software, developers have more control

They are now demanding infrastructure for dev/test
They are getting budgets
Businesses are aligning more with developers
They are being rewarded more, business wants developers to be happy

Traditionally incumbents have great advantage in access to customers

Relationships
Account management etc
Procurement
Certifications
Analysts
magic quadrants etc
Channels
Must purchase through channel partner
"Day 2" operations

The enterprise sales process is a complicated and messy process where consumers are engaged with channel partners etc with whom they have never previously dealt. The ability to buy soemthing is so complex it even requires specialists to understand how to purchase. Think IBM presales - how does one purchase a Power7 server? You don't just go to OfficeWorks and buy one.

The movement to software has completely invalidated this

Devs don't care about

Gartner Report
Certifications / training
long standing relationships
Golf
Slow procurement process

Developers do care about

Traction among other developers
Community
APIs
Try & Buy
Low friction to initial adoption
Open Source and technical elegance

There are companies which appear out of nowhere (e.g. Atlassian) and generate billions of dollars in revenue because they target the developer-centric software-driven market.

Open Source:

Investment to date

$7B Venture Capital has been invested in Open Source
$1B has been returned

Why?

So far nobody seems to have discovered the secret to making open source profitable.

Open source platforms aren't typically profitable by themselves, you still need to build a massive army to educate developers etc to successfully innovate on top of these "loss-leader" platforms.
Open source is great for getting traction.

Moving towards "open source as a service" for profitability, e.g. GitHub as a service, digital ocean, data bricks, etc.

Incumbent lifecycle

No traditional infrastructure silo is safe

storage
compute
networking
security
databases
analytics
development

All startups will become incumbents (or will die). What is innovative today will be irrelevant soon.

One unanswered topic - with the significant push towards software defined everything, driving development trends towards software (i.e. students etc are learning software), how do we facilitate innovation and rapid development in hardware?

VC's don't invest as readily in hardware trends, software brings a safer return. Again, if nobody is willing to invest because the hardware lifecycle is so short, how do we drive it forward?

Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers

Guo Chen - Tsinghua University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/chen

This paper looked at preemtively getting rid of TCP Losses instead of waiting for timeouts to occur, accelerating loss recovery.

Tail Flow Completion Time

Services care about tail flow completion time. Large number of flows per operation.
Overall performance goveredned by last completed flow
Packet loss hurts tail FCT

Case Study - Microsoft DCN

Measured packet loss in Mucrosoft DCN
mean loss rate about 4% over 5 days

Caused by

congestion due to buffer overflow
uneven load balancer
Incast
Failure loss (silent loss)
random drops
packet black hole

Why packet loss hurts tail flow

How TCP handles

Fast recovery (wait for DACKS)
timeout (if not enough DACKs, retransmit)
e.g. RTO>>RTT 100ms, retransmit

Prior work in TCP adds aggressiveness to congestion control to do loss recovery before timeout. Deciding how long to wait for timeout before resending is a challenge.
If sending recovery packets too quickly, causes congestion and exponential further loss.

Ideal loss recovery algorithm

Should be fast for failure loss
Should be slow for congestion loss

(these are opposite and incompatible approaches)

Proposed approach

Multi Path Loss Recovery (they call this "FUSO")

Use one path for initial data transfer
Use a second path for loss recovery

Assume a 3-path connection (think LACP, but this isn't really an LACP implementation).
It identifies the best subflow (i.e. path) and the paths. Loss recovery is performed on alternate paths rather than retrying the same path.

Using the second path for retransmits, it is able to initiate loss recovery faster than traditional TCP loss recovery, resulting in far shorter tail flow completion.

In the case study, introducing FUSO reduced packet loss from ~4% to less than 1% and significantly improved tail Flow Completion Time.

StackMap - Low-Latency Networking with the OS Stack and dedicated NICs

Michio Honda (Keio University)

https://www.usenix.org/conference/atc16/technical-sessions/presentation/yasukata

An interesting kernel-bypass approach combining the fast data path and sockets.

Current landscape:

Message-oriented communitcation over TCP is common (e.g. HTTP, CdNs, etc)
Linux network stack can serce 1KB messages only at 3.5Gbps with a single core.

Should the socket API be improved?

User-space TCP/IP stack - maintaining and updating today's TCP stack is hard, difficult to keep it compatible.

The StackMap approach achieves greater throughput (4.5Gbps) at lower latencies.

Current approach - many requests are processed with each epoll_wait() cycle. New requests are queued in the kernel.

Increasing the number of TCP connections result in higher epoll_wait and higher latency.

Where could we improve?

Conventional systems introduce end to end latency of 10's to 100's microseconds of latency
Socket API comes at significant cost (read/write/epoll_wait)
Packet IO is expensive
TCPIP processing is cheap

Stackmap approach

Dedicating a NIC to an application
use Netmap API for data path (syscall and packet IO batching, zero copy, run-to-completion)
Persistent, fixed-size sk_buffs (efficiently call into kernel TCPIP (as it still needs to talk to kernel)
Static packet buffers and DMA mapping

Experimental results

WRK http and memaslap memcached benchmark tools used

HTTP serving 1KB messages on single core

Throughput 4Gbps (traditoinal) vs 6Gbps (mapstack)
latency 300ms (traditional) vs 150ms (stackmap)

(wasn't quick enough to capture the other examples, but they found as they increased cores the improvements were less significant)

Conclusion

Message oriented communication over TCP
Kernel TCPIP is fast
but socket API and packet I/O are slow
Can bring most techniquest used bykernel bypass stacks into OS stacks
Stackmap provides
Latency reduction by betweenn 4% and 80%
Throughput improvement between 4% and 391%

Scalable low latency indexes for a Key Value Store

Ankita Kejriwal - Stanford University / PlatformLab

https://www.usenix.org/conference/atc16/technical-sessions/presentation/kejriwal

Can a key value store support strongly consistent secondary indexes whilst operating at low latency and large scale?

Implemented SLIK and achieved low latency, scalability

Traditional RDBMs were prevalent.
provided scalability
lacked data models and consistency
Moved to NoSQL to gain latency, scalability and/or consistency

Consistency (e.g. when changing a key value) - achieved by writing a new record, pointing it to the original key, then removing the original value. There is a small amount of time where both values exist. When a query is performed, if the first data is returned by the index, when it seeks the value, it will identify the index copy as stale and ignore it, instead looking at the true up-to-date data.

Understanding Manycore Scalability of Filesystems

Changwoo Min - Georgia Institute of Technology

https://www.usenix.org/conference/atc16/technical-sessions/presentation/min

Manycore vs MultiCore Intro

MultiCore processors provide high speed (e.g. 3GHz per core) across a small number of cores. Typically up to 20 cores
ManyCore systems deliver tens or hundreds of cores with slower speeds.

Scaling Today

applications need to parallelise IO operations.
death of single core CPU scaling (frequency staying around 3GHz)
cores upper limit around 24 cores

From mechanical HDD to Flash SSD

IOPS up to 1 million
Non Volatile Memory (e.g. 3D xPoint) offering 1,000x improvements over SSH

There is typically a lack of understanding of internal scalability in applications.
Quite often, adding more CPU cores will not improve performance in a linear fashion. Perhaps 10 - 20 cores will see improvements, but not 50, 100, 200 cores etc.
Filesystems seem to become the performance bottleneck after adding many cores.

What filesystem operations are not scalable?
Why are they not scalable?
Is it an implementation problem or a poor design?

Tech challenges

Cannot see the next bottleneck until solving the current one.
this makes it difficult to understand scalability behaviour

Evaluation & Analysis

The researchers have used FxMark to evaluate and analyse manycore scalability, across multiple filesystem types, CPU core numbers, RAMdisk, different sharing levels (i.e. process either shares blocks, or accesses independent blocks etc).

Accessing independent blocks vs shared

When accessing independent blocks, we see linear scalability
As we start to see file sharing, performance degrades as more CPUs are introduced. High levels of file sharing are woefully poor across all filesystems.

This is because kernel page access reference counting is unable to handle page evictions when running at scale. High contention on a page reference counter results in a huge memory stall. CPU cycles then wait for memory access.

Lesson Learnt - scalability of the cache hit is important.

Data block overwrite

At low sharing level, btrfs is copy on write (providing a good level of consistency). The new block allocation actions create an IO bottleneck as more work is involved in checkpointing.

Lesson Learnt - Overwriting could be as expensive as appending. Consistency scalability mechanisms need to be scalable.

if enture file is locked regardless of update range.

Lesson learnt - a file cannot be concurrently updated. Need to consider techniques used in parallel filesystems.

Pilot

if contention is removed from the filesystem, can we scale further than ~80 cores?

yes - partitioning the ramdisk filesystem (60 partitions) doubled the performance of the overall system at around 20 cores.
However doing this on a physical HDD significantly reduced performance.

Summary:

Manycore scalability should be important in file system design
new challenges in scalable file system design
minimising contention
scalable consistency
spatial locality

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Jiacheng Zhang, Tsinghua University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/zhang

Most flash filesystems use log structured filesystems. The problems with these include

duplicate functions (performed both on the Flash device and on the file system)
space allocation, garbage collection (filesystem simply issues TRIM, doesn't coordinate actions)
Semantic Isolation
it can't know what is happening on the other level
Block IO interface
Log on Log

F2FS specific analysis

F2FS has poorer performance than ext4 on SSDs

Lower garbage collection efficiency
more recycled blocks
Internal Parallelism Conflicts
Broken Data Grouping
uncoordinated GC operations
ineffective IO Scheduling - erase operations always block read/write, whilst writes delay the reads.

Current approaches

Log structured file system
(see problems above)
Object based filesystems
very aggressive changes difficult to adopt.
lack of research

ParaFS

Coordinated block mapping, coordinated GC, coordinated scheduling, parallel-aware filesystem.

Simplified FTL
exposing physical layout to FS
flash channels
Size of flash block
size of flash page
static block mapping
Aligned block layout
GC erase process simplified
WL, ECC: functions which need hardware support
Multi threaded optimisation
one GC process per region
GC control process
Request dispatching
select the least busy channel to dispatch a write request
Request scheduling phase
Time slice for read request scheduling and write/erase request scheduling
Schedule Write or Erase Request according to Space Utilisation and Number of Concurrent Erasing Channels

Evaluation of ParaFS

Under light load, there isn't much improvement
Under heavy load, generally outperforms all other filesystem types significantly.

GC:
significantly less recycled blocks
high GC efficiency
Approx 30% decrease in written IO due to efficiency gains

Conclusion

Internal parallelism has not been leveraged in current filesystems

Optimising every operation

Rob Johnson - Stony Brook University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/yuan

This is a re-presentation from FAST16
https://www.usenix.org/conference/fast16/technical-sessions/presentation/yuan

which is a followup from last year's Usenix paper - https://www.usenix.org/conference/fast15/technical-sessions/presentation/jannen)

What is BetrFS

BetrFS is a High performance general purpose filesystem, attempting to achieve the best performance features of several filesystem design approaches.

Trade-offs

Filesystem design elements (such as a choice between full filesystem indexing vs inode addressing) will dictate how the system behave in different circumstances. Examples of trade-offs include

Sequential Reads vs random writes
Renames vs recursive filesystem scans

full filesystem indexing (not inodes)

Recursive grep (in BetrFS)

starts by reading the index to understand the layout
then performs a full sequential read to analyse the data
this behaviour is optimal as it will utilise close the full available IO throughput of the disk in a serial read manner
BetrFS is 15x faster than ext4 at this activity

Moving a directory (in BetrFS)

needs to re-generate the index, takes a long time
betrfs is more than 30x slower than ext4

So presently, there isn't a "one size fits all" solution to achieve optimal performance for all activities.

Proposed solution - BetrFS filesystem zoning

A technique for fast renames and scans

This aims to deliver enough indirection (via inode table updates) for fast renames, enough locality (leveraging filesystem indexing) for fast scans - best of both approaches

A single zone contains a set of recursive directories (i.e a group of directories).
inodes track the zones, filesystem index contains the set of files within the zone.

Moving the root of the zone is cheap
Moving the contents within a zone to a different zone is more expensive
To get fast renames we need small zones
to get fast scans we need large zones
Ideally looking for the sweet spot in the middle

How big should zones be?
Depends on the data in there...

File deletions

Deletion in BetrFS was linear in terms of the file size. Started slow and just got slower. Now using key value pair method they have improved this to be faster than ext4

Conclusion

BetrFS is still in early days and in my opinion needs to be specifically evaluated before deploying into an environment. Newer versions seem to achieve better performance results but the tradeoff is more up-front design decisions (i.e. we can't simply create a generic BetrFS filesystem and use it). Very interesting approaches to overcoming performance challenges.

Environmental Conditions and Disk Reliability in Free-cooled Datacenters

Ioannis Manousakis - Rutgers, The State University of New Jerset

https://www.usenix.org/conference/atc16/technical-sessions/presentation/manousakis

The Problem:

Data Centres are expensive and consume a lot of energy
Evolving cooling technologies in data centres
Chiller based (always-on)
Water-side
Free Cooling
Unexplored tradeoff - environmentals, reliability, cost.

Free cooled air cooling, takes outdoor air and simply puts it through water (to cool it) and flows through the DC. The humidity increases, but costs are significantly lower.

The study

The researchers looked at the impact of environmentals on disk failures, root causes and considerations for DC and physical system design.

They collected telemetry on 1 million disks across 9 Microsoft data centres, including

cold
hot
dry
humid
Location of failed disk (within chassis, within DC)

(Data comes from a product called "Microsoft Autopilot")

They then looked further into root causes such as

IO Comms faults
Behavioural SMART faults (read/write errors, sector errors, seek errors, etc)
Age related (max hours, on-off cycles, etc)

and mapped these against the environmentals.

Results:

Dry DCs - falure rates 1.5% to 2.3%
Humid DCs - 3.1% to 5.4%

Root Causes for the extra failures with high humidity:

Dry DCs - Bad sectors (mechanical) ~50%-60%
Humid DCs - Controller (connectivity) ~60%

Furthermore, they have observed the disks are corroded in humid DCs.

Comparison hot dry and hot humid

The process is not instant - corrosion takes time
But they do notice in a humid summer the accelleration of corrosion is much higher

Cooling vs reliability vs cost trade offs

They calculated the cooling costs for data centres for 10, 15 and 20 years, added the cost for buying extra replacement disks (especially after warranty).

Significantly cheaper in the long term to use free-cooled data centres.

Other observations:

No increase in mechanical failures due to humiditr
failures don't occur instantly

Disk placement within a chassis

Relative humidity changes depending on whether disks are at the front or the back

Disks at the back of a chassis will have 20% lower average failure rate (AFR), as the hotter temperature at the rear results in a lower relative humidity (i.e. same moisture content but higher temp)
Server layout has a significant impact on HDD AFRs
If moving disks to the back, you are simply moving the corrosion to other components.

Further comments

(These are my own views)

I'd imagine manufacturers of disks and servers would be paying close attention to this research. Wondering if this could affect design of servers (e.g. what is the cheapest component to replace, if you have to choose one to suffer corrosion problems). Also, would vendors modify their maintenance and warranty pricing (or offer discounts) depending on the data centre environmentals (some hard evidence here of differences in failure rates). Further - could this impact environmental policies (Is an air-cooled DC now preferred? + what is the cost of lots of failed disks in landfill compared to energy consumption to cool DCs?)

Keynote - The Future of Infrastructure

How VCs come up with investment decisions

Software defined movement

APIs

Nicira story

Why did they develop software-defined?

Software disrupts the Infrastructure Delivery model

Software As A Service

Rise of the developers

Open Source:

Incumbent lifecycle

Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers

Tail Flow Completion Time

Case Study - Microsoft DCN

Why packet loss hurts tail flow

Proposed approach

StackMap - Low-Latency Networking with the OS Stack and dedicated NICs

Where could we improve?

Stackmap approach

Experimental results

Conclusion

Scalable low latency indexes for a Key Value Store

Understanding Manycore Scalability of Filesystems

Manycore vs MultiCore Intro

Scaling Today

cores upper limit around 24 cores

Tech challenges

Evaluation & Analysis

Accessing independent blocks vs shared

Data block overwrite

if enture file is locked regardless of update range.

Pilot

Summary:

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

F2FS specific analysis

Current approaches

ParaFS

flash channels

Evaluation of ParaFS

Conclusion

Optimising every operation

What is BetrFS

Trade-offs

full filesystem indexing (not inodes)

Proposed solution - BetrFS filesystem zoning

File deletions

Conclusion

Environmental Conditions and Disk Reliability in Free-cooled Datacenters

The Problem:

The study

Results:

Root Causes for the extra failures with high humidity:

Comparison hot dry and hot humid

Cooling vs reliability vs cost trade offs

Other observations:

Disk placement within a chassis

Further comments