Usenix ATC 2016 - Day 3

This is the third day of the Usenix Annual Technical conference June 2016, held in Denver, Colorado, USA. Today the main conference commenced (after the HotStorage/HotCloud workshop series concluded).

As each presentation was delivered by a single researcher, only this main presenter is referenced in the below text. The links to the Usenix conference page provide more information as well as copies of the papers submitted for the conference.

Keynote - The Future of Infrastructure

Martin Casado

https://www.usenix.org/conference/atc16/technical-sessions/presentation/casado

How VCs come up with investment decisions

The prevailing view is that infrastructure is dead. (traditional infrastructure).
Many large infra companies (e.g. IBM) are seeing massive downturns in revenue.
AWS/Azure/Softlayer/etc is directing efforts towards focussing on the software/app layer, leave infra to someone else.

IT is a very large market - $4 trillion per year (not sure if US or global)
IT has been hugely disrupted. The disruption is only getting started.
$220 billion in cloud, only 6% of the total IT market.

Massive change at a business level - IT products are no longer sold B2B etc, they are being sold directly to EVERYONE (especially with smartphones, Internet Of Things etc).

The market size of IT is further growing due to the increased number of devices connected (IOT, smartphones, etc).

It is very difficult for companies to cannibalise themselves so they can pivot in a different direction. Some have managed to do this well e.g. Honeywell now make thermostats, but many are not able to adapt to changing IT landscapes.

Software defined movement

There are now many Software Defined movements. For instance, the smartphone has disrupted many fixed-function devices - gaming devices, GPS, mail clients, telephones, etc.

This was further driven by a significant global uptake - "everyone has an iPhone", therefore apps and features are further developed for the emerging platform.

APIs

Previously the only way to introduce functionality was to wrap it in sheet metal and delver a fixed function device (e.g. a TomTom GPS, a Casio calculator, etc). The only standard 'API' was IP (networking protocols). As a startup, there was only one way to deliver something for it to be adopted.

Now, there are many layers of integration (hypervisor, platform, OS, application, container) and commonly adopted API driven frameworks allowing rapid development and widespread consumption.

Nicira story
  • Founded in 2007
  • Implemented a platform where all networking was defined at a software layer
  • received $40M government funding to build software network switch + controller
  • $1.26B acquisition by VMWare in 2012
  • VMWare Networking is now a $600M p.a. business in 2015
Why did they develop software-defined?

One of the limitations in networking was that features/functionality was always defined by what the vendor was willing to offer and was typically heavily tied to physical hardware.
Software-defined networking was an answer to this - if you need new features, it is as simple as writing the code.

Traditionally when people think about IT they think about hardware. The supply chain knows how to sell hardware. There was a perception that you couldn't simply create software-defined infrastructure.

Software disrupts the Infrastructure Delivery model

Traditional:

  • Supply chain
  • physical design and manufacture
  • Inventory management
  • Delivery
  • Physical box
  • Costly trials
  • Product insertion
  • Rip & replace
  • Costly to revert

Software-defined

  • Supply Chain
  • N/A
  • Delivery
  • Download & Trial
  • Product insertion
  • flexible low-cost implementations (parallel, cutover etc)
  • Safe/easy snapshot revert
Software As A Service

Applications made this transition in the late 2000s

Old world: SAP (on-premise)
New World: Salesforce (cloud)

New world also has better hooks - allows integration into other services at all layers (security, availability, new features etc) without the complexity of supporting physical hardware. Also provides better opportunities for developer and sysadmin support.

Rise of the developers

With the move to software, developers have more control

  • They are now demanding infrastructure for dev/test
  • They are getting budgets
  • Businesses are aligning more with developers
  • They are being rewarded more, business wants developers to be happy

Traditionally incumbents have great advantage in access to customers

  • Relationships
  • Account management etc
  • Procurement
  • Certifications
  • Analysts
  • magic quadrants etc
  • Channels
  • Must purchase through channel partner
  • "Day 2" operations

The enterprise sales process is a complicated and messy process where consumers are engaged with channel partners etc with whom they have never previously dealt. The ability to buy soemthing is so complex it even requires specialists to understand how to purchase. Think IBM presales - how does one purchase a Power7 server? You don't just go to OfficeWorks and buy one.

The movement to software has completely invalidated this

Devs don't care about

  • Gartner Report
    Certifications / training
  • long standing relationships
  • Golf
  • Slow procurement process

Developers do care about

  • Traction among other developers
  • Community
  • APIs
  • Try & Buy
  • Low friction to initial adoption
  • Open Source and technical elegance

There are companies which appear out of nowhere (e.g. Atlassian) and generate billions of dollars in revenue because they target the developer-centric software-driven market.

Open Source:

Investment to date

  • $7B Venture Capital has been invested in Open Source
  • $1B has been returned

Why?

So far nobody seems to have discovered the secret to making open source profitable.

Open source platforms aren't typically profitable by themselves, you still need to build a massive army to educate developers etc to successfully innovate on top of these "loss-leader" platforms.
Open source is great for getting traction.

Moving towards "open source as a service" for profitability, e.g. GitHub as a service, digital ocean, data bricks, etc.

Incumbent lifecycle

No traditional infrastructure silo is safe

  • storage
  • compute
  • networking
  • security
  • databases
  • analytics
  • development

All startups will become incumbents (or will die). What is innovative today will be irrelevant soon.

One unanswered topic - with the significant push towards software defined everything, driving development trends towards software (i.e. students etc are learning software), how do we facilitate innovation and rapid development in hardware?

VC's don't invest as readily in hardware trends, software brings a safer return. Again, if nobody is willing to invest because the hardware lifecycle is so short, how do we drive it forward?

Leveraging Multi-path Diversity for Transport Loss Recovery in Data Centers

Guo Chen - Tsinghua University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/chen

This paper looked at preemtively getting rid of TCP Losses instead of waiting for timeouts to occur, accelerating loss recovery.

Tail Flow Completion Time

  • Services care about tail flow completion time. Large number of flows per operation.
  • Overall performance goveredned by last completed flow
  • Packet loss hurts tail FCT

Case Study - Microsoft DCN

Measured packet loss in Mucrosoft DCN
mean loss rate about 4% over 5 days

Caused by

  • congestion due to buffer overflow
  • uneven load balancer
    Incast
  • Failure loss (silent loss)
  • random drops
  • packet black hole

Why packet loss hurts tail flow

How TCP handles

  • Fast recovery (wait for DACKS)
  • timeout (if not enough DACKs, retransmit)
  • e.g. RTO>>RTT 100ms, retransmit

Prior work in TCP adds aggressiveness to congestion control to do loss recovery before timeout. Deciding how long to wait for timeout before resending is a challenge.
If sending recovery packets too quickly, causes congestion and exponential further loss.

Ideal loss recovery algorithm

  • Should be fast for failure loss
  • Should be slow for congestion loss

(these are opposite and incompatible approaches)

Proposed approach

Multi Path Loss Recovery (they call this "FUSO")

  • Use one path for initial data transfer
  • Use a second path for loss recovery

Assume a 3-path connection (think LACP, but this isn't really an LACP implementation).
It identifies the best subflow (i.e. path) and the paths. Loss recovery is performed on alternate paths rather than retrying the same path.

Using the second path for retransmits, it is able to initiate loss recovery faster than traditional TCP loss recovery, resulting in far shorter tail flow completion.

In the case study, introducing FUSO reduced packet loss from ~4% to less than 1% and significantly improved tail Flow Completion Time.

StackMap - Low-Latency Networking with the OS Stack and dedicated NICs

Michio Honda (Keio University)

https://www.usenix.org/conference/atc16/technical-sessions/presentation/yasukata

An interesting kernel-bypass approach combining the fast data path and sockets.

Current landscape:

  • Message-oriented communitcation over TCP is common (e.g. HTTP, CdNs, etc)
  • Linux network stack can serce 1KB messages only at 3.5Gbps with a single core.

Should the socket API be improved?

User-space TCP/IP stack - maintaining and updating today's TCP stack is hard, difficult to keep it compatible.

The StackMap approach achieves greater throughput (4.5Gbps) at lower latencies.

Current approach - many requests are processed with each epoll_wait() cycle. New requests are queued in the kernel.

Increasing the number of TCP connections result in higher epoll_wait and higher latency.

Where could we improve?
  • Conventional systems introduce end to end latency of 10's to 100's microseconds of latency
  • Socket API comes at significant cost (read/write/epoll_wait)
  • Packet IO is expensive
  • TCPIP processing is cheap

Stackmap approach

  • Dedicating a NIC to an application
  • use Netmap API for data path (syscall and packet IO batching, zero copy, run-to-completion)
  • Persistent, fixed-size sk_buffs (efficiently call into kernel TCPIP (as it still needs to talk to kernel)
  • Static packet buffers and DMA mapping
Experimental results

WRK http and memaslap memcached benchmark tools used

  1. HTTP serving 1KB messages on single core
  • Throughput 4Gbps (traditoinal) vs 6Gbps (mapstack)
  • latency 300ms (traditional) vs 150ms (stackmap)

(wasn't quick enough to capture the other examples, but they found as they increased cores the improvements were less significant)

Conclusion

  • Message oriented communication over TCP
    Kernel TCPIP is fast
  • but socket API and packet I/O are slow
  • Can bring most techniquest used bykernel bypass stacks into OS stacks
  • Stackmap provides
  • Latency reduction by betweenn 4% and 80%
  • Throughput improvement between 4% and 391%

Scalable low latency indexes for a Key Value Store

Ankita Kejriwal - Stanford University / PlatformLab

https://www.usenix.org/conference/atc16/technical-sessions/presentation/kejriwal

Can a key value store support strongly consistent secondary indexes whilst operating at low latency and large scale?

Implemented SLIK and achieved low latency, scalability

  • Traditional RDBMs were prevalent.
  • provided scalability
  • lacked data models and consistency
  • Moved to NoSQL to gain latency, scalability and/or consistency

Consistency (e.g. when changing a key value) - achieved by writing a new record, pointing it to the original key, then removing the original value. There is a small amount of time where both values exist. When a query is performed, if the first data is returned by the index, when it seeks the value, it will identify the index copy as stale and ignore it, instead looking at the true up-to-date data.

Understanding Manycore Scalability of Filesystems

Changwoo Min - Georgia Institute of Technology

https://www.usenix.org/conference/atc16/technical-sessions/presentation/min

Manycore vs MultiCore Intro

MultiCore processors provide high speed (e.g. 3GHz per core) across a small number of cores. Typically up to 20 cores
ManyCore systems deliver tens or hundreds of cores with slower speeds.

Scaling Today

  • applications need to parallelise IO operations.
  • death of single core CPU scaling (frequency staying around 3GHz)
  • cores upper limit around 24 cores

From mechanical HDD to Flash SSD

  • IOPS up to 1 million
  • Non Volatile Memory (e.g. 3D xPoint) offering 1,000x improvements over SSH

There is typically a lack of understanding of internal scalability in applications.
Quite often, adding more CPU cores will not improve performance in a linear fashion. Perhaps 10 - 20 cores will see improvements, but not 50, 100, 200 cores etc.
Filesystems seem to become the performance bottleneck after adding many cores.

  1. What filesystem operations are not scalable?
  2. Why are they not scalable?
  3. Is it an implementation problem or a poor design?

Tech challenges

  • Cannot see the next bottleneck until solving the current one.
  • this makes it difficult to understand scalability behaviour

Evaluation & Analysis

The researchers have used FxMark to evaluate and analyse manycore scalability, across multiple filesystem types, CPU core numbers, RAMdisk, different sharing levels (i.e. process either shares blocks, or accesses independent blocks etc).

Accessing independent blocks vs shared

When accessing independent blocks, we see linear scalability
As we start to see file sharing, performance degrades as more CPUs are introduced. High levels of file sharing are woefully poor across all filesystems.

This is because kernel page access reference counting is unable to handle page evictions when running at scale. High contention on a page reference counter results in a huge memory stall. CPU cycles then wait for memory access.

Lesson Learnt - scalability of the cache hit is important.

Data block overwrite

At low sharing level, btrfs is copy on write (providing a good level of consistency). The new block allocation actions create an IO bottleneck as more work is involved in checkpointing.

Lesson Learnt - Overwriting could be as expensive as appending. Consistency scalability mechanisms need to be scalable.

if enture file is locked regardless of update range.

Lesson learnt - a file cannot be concurrently updated. Need to consider techniques used in parallel filesystems.

Pilot

if contention is removed from the filesystem, can we scale further than ~80 cores?

  • yes - partitioning the ramdisk filesystem (60 partitions) doubled the performance of the overall system at around 20 cores.
  • However doing this on a physical HDD significantly reduced performance.

Summary:

  • Manycore scalability should be important in file system design
  • new challenges in scalable file system design
  • minimising contention
  • scalable consistency
  • spatial locality

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Jiacheng Zhang, Tsinghua University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/zhang

Most flash filesystems use log structured filesystems. The problems with these include

  • duplicate functions (performed both on the Flash device and on the file system)
  • space allocation, garbage collection (filesystem simply issues TRIM, doesn't coordinate actions)
  • Semantic Isolation
  • it can't know what is happening on the other level
  • Block IO interface
  • Log on Log
F2FS specific analysis

F2FS has poorer performance than ext4 on SSDs

  • Lower garbage collection efficiency
  • more recycled blocks
  • Internal Parallelism Conflicts
  • Broken Data Grouping
  • uncoordinated GC operations
  • ineffective IO Scheduling - erase operations always block read/write, whilst writes delay the reads.
Current approaches
  • Log structured file system
  • (see problems above)
  • Object based filesystems
  • very aggressive changes difficult to adopt.
  • lack of research

ParaFS

Coordinated block mapping, coordinated GC, coordinated scheduling, parallel-aware filesystem.

  • Simplified FTL
  • exposing physical layout to FS
  • flash channels

  • Size of flash block
  • size of flash page
  • static block mapping
  • Aligned block layout
  • GC erase process simplified
  • WL, ECC: functions which need hardware support
  • Multi threaded optimisation
  • one GC process per region
  • GC control process
  • Request dispatching
  • select the least busy channel to dispatch a write request
  • Request scheduling phase
  • Time slice for read request scheduling and write/erase request scheduling
  • Schedule Write or Erase Request according to Space Utilisation and Number of Concurrent Erasing Channels

Evaluation of ParaFS

Under light load, there isn't much improvement
Under heavy load, generally outperforms all other filesystem types significantly.

  • GC:
  • significantly less recycled blocks
  • high GC efficiency
  • Approx 30% decrease in written IO due to efficiency gains

Conclusion

Internal parallelism has not been leveraged in current filesystems

Optimising every operation

Rob Johnson - Stony Brook University

https://www.usenix.org/conference/atc16/technical-sessions/presentation/yuan

This is a re-presentation from FAST16
https://www.usenix.org/conference/fast16/technical-sessions/presentation/yuan

which is a followup from last year's Usenix paper - https://www.usenix.org/conference/fast15/technical-sessions/presentation/jannen)

What is BetrFS

BetrFS is a High performance general purpose filesystem, attempting to achieve the best performance features of several filesystem design approaches.

Trade-offs

Filesystem design elements (such as a choice between full filesystem indexing vs inode addressing) will dictate how the system behave in different circumstances. Examples of trade-offs include

  • Sequential Reads vs random writes
  • Renames vs recursive filesystem scans
full filesystem indexing (not inodes)

Recursive grep (in BetrFS)

  • starts by reading the index to understand the layout
  • then performs a full sequential read to analyse the data
  • this behaviour is optimal as it will utilise close the full available IO throughput of the disk in a serial read manner
  • BetrFS is 15x faster than ext4 at this activity

Moving a directory (in BetrFS)

  • needs to re-generate the index, takes a long time
  • betrfs is more than 30x slower than ext4

So presently, there isn't a "one size fits all" solution to achieve optimal performance for all activities.

Proposed solution - BetrFS filesystem zoning

A technique for fast renames and scans

This aims to deliver enough indirection (via inode table updates) for fast renames, enough locality (leveraging filesystem indexing) for fast scans - best of both approaches

A single zone contains a set of recursive directories (i.e a group of directories).
inodes track the zones, filesystem index contains the set of files within the zone.

  • Moving the root of the zone is cheap
  • Moving the contents within a zone to a different zone is more expensive
  • To get fast renames we need small zones
  • to get fast scans we need large zones
  • Ideally looking for the sweet spot in the middle

How big should zones be?
Depends on the data in there...

File deletions

Deletion in BetrFS was linear in terms of the file size. Started slow and just got slower. Now using key value pair method they have improved this to be faster than ext4

Conclusion

BetrFS is still in early days and in my opinion needs to be specifically evaluated before deploying into an environment. Newer versions seem to achieve better performance results but the tradeoff is more up-front design decisions (i.e. we can't simply create a generic BetrFS filesystem and use it). Very interesting approaches to overcoming performance challenges.

Environmental Conditions and Disk Reliability in Free-cooled Datacenters

Ioannis Manousakis - Rutgers, The State University of New Jerset

https://www.usenix.org/conference/atc16/technical-sessions/presentation/manousakis

The Problem:

  • Data Centres are expensive and consume a lot of energy
  • Evolving cooling technologies in data centres
  • Chiller based (always-on)
  • Water-side
  • Free Cooling
  • Unexplored tradeoff - environmentals, reliability, cost.

Free cooled air cooling, takes outdoor air and simply puts it through water (to cool it) and flows through the DC. The humidity increases, but costs are significantly lower.

The study

The researchers looked at the impact of environmentals on disk failures, root causes and considerations for DC and physical system design.

They collected telemetry on 1 million disks across 9 Microsoft data centres, including

  • cold
  • hot
  • dry
  • humid
  • Location of failed disk (within chassis, within DC)

(Data comes from a product called "Microsoft Autopilot")

They then looked further into root causes such as

  • IO Comms faults
  • Behavioural SMART faults (read/write errors, sector errors, seek errors, etc)
  • Age related (max hours, on-off cycles, etc)

and mapped these against the environmentals.

Results:

Dry DCs - falure rates 1.5% to 2.3%
Humid DCs - 3.1% to 5.4%

Root Causes for the extra failures with high humidity:

Dry DCs - Bad sectors (mechanical) ~50%-60%
Humid DCs - Controller (connectivity) ~60%

Furthermore, they have observed the disks are corroded in humid DCs.

Comparison hot dry and hot humid

The process is not instant - corrosion takes time
But they do notice in a humid summer the accelleration of corrosion is much higher

Cooling vs reliability vs cost trade offs

They calculated the cooling costs for data centres for 10, 15 and 20 years, added the cost for buying extra replacement disks (especially after warranty).

Significantly cheaper in the long term to use free-cooled data centres.

Other observations:
  • No increase in mechanical failures due to humiditr
  • failures don't occur instantly
Disk placement within a chassis

Relative humidity changes depending on whether disks are at the front or the back

  • Disks at the back of a chassis will have 20% lower average failure rate (AFR), as the hotter temperature at the rear results in a lower relative humidity (i.e. same moisture content but higher temp)
  • Server layout has a significant impact on HDD AFRs
  • If moving disks to the back, you are simply moving the corrosion to other components.

Further comments

(These are my own views)

I'd imagine manufacturers of disks and servers would be paying close attention to this research. Wondering if this could affect design of servers (e.g. what is the cheapest component to replace, if you have to choose one to suffer corrosion problems). Also, would vendors modify their maintenance and warranty pricing (or offer discounts) depending on the data centre environmentals (some hard evidence here of differences in failure rates). Further - could this impact environmental policies (Is an air-cooled DC now preferred? + what is the cost of lots of failed disks in landfill compared to energy consumption to cool DCs?)