Usenix ATC 2016

This is the second day of the Usenix Annual Technical conference June 2016, held in Denver, Colorado, USA.
Specifically, the "HotStorage" and "HotCloud" events were held over the first two days of the conference.

As each presentation was delivered by a single researcher, only this main presenter is referenced in the below text. The links to the Usenix conference page provide more information as well as copies of the papers submitted for the conference.

Keynote - Changes in Big Data (in the past 10 years)

Matei Zaharia - Creator of Apache Spark / Apache Mesos

https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/keynote-address

Having not worked with Big Data platforms, this was an interesting view for me into the evolution of Big Data over the past decade.

Users: engineers have become analysts
Hardware: Performance bottleneck has shifted from IO to compute
Delivery: Strong trend toward public cloud

Initial big data users have been software engineers - Java, C++ etc, building low-level programs to process data
User base was expanded to include scripting / query languages (sql, awk inspired)

New roels: Data scientists (technical domain experts) and Analysts (business-minded, not overly technical)

Challenges for non-engineers

API familiarity
Performance predictability and debugging
The data set is still massive, standard methods for analysing small data sets are no longer useful for big data sets
Debugging on distributed systems is complex (e.g. logging needs to be abstracted away from the nodes, joining performance data sets across multiple nodes can be difficult etc)
Access from snall data tools (tools such as Excel aren't easily connected to big data sets)

Case study: Apache Spark

This is a cluster computing engine, with a large amount of inbuild Libraries and APIs for ML, SQL, Graph processing and streaming.

Original API was aimed at developers, using Java / Scala. Challenges with Functional APIs and using them in an effective/efficient manner.

More recently an increase has been seen in Python adoption etc.

Efficient library for Structured Data includes

SQL API
Data Frame API

Dataframes result in significant decrease in storage sizes and a significant increase in performance when analysing the data.

Although Data Frames only 1 year old, a majority of Big Data users are using data frames - huge adoption in just a year.

Hardware Changes in big data

Typical node trends

2010 (per node):

50+ MB/s storage performance (HDD)
1Gbps Ethernet
CPU ~3GHz per core

2016 (per node):

500MB/s (SSD)
10Gbps Ethernet
CPU ~3GHz per core

Network and storage have increased by about 10x factor
CPU has not increased
Systems are also now being memory-bus bound (e.g. random memory access is quite slow when working with big data sets)

Prior to 2010, IO was the main challenge

Now, compute efficiency is far more important. Further complicated by introduction of GPUs, FPGA etc.

Future network cards will be at ~DRAM bandwidth

Case Study: Spark Efort: Project Tungsten

Optimisation of CPU and memory usage via

runtime code generation
Binary encoding (using less space)

Performance due to these code optimisations of almost 10x increase in rows returned per second.

Similar Code improvements are being seen across other platforms (e.g. Nested Vector Language project at MIT), including

Hyper Database
GraphMat (e.g. used by PageRank)
TensorFlow (e.g. used by Word2Vec)

Some challenges in these improvements have been around

Getting performance improvements through optimised code whilst maintaingin ease of use for non-programmers
optimisations across multiple libraries/systems

Evolution in Cloud Delivery for big data

Some highlights:

Many Fortune 100 companies have multiple PB of data in S3 (looking for big data engines to connect to this data and analyse it)
AWS revenue is now $10B annually
Recent (2015) survey indicates over 50% of respondents running big data on public cloud

Cloud adoption increasing because

workloads are bursty
platform requirements are unpredictable
changes in big data processing aren't easy to keep up with in your own data center (e.g. dependence on faster networking, faster IO etc)
Cloud users:
The purchase an end to end experience, not just components
Able to rapidly experiement with new solutions/approaches without capital expenditure risk
For software vendors:
Better end to end service
greater visibility, debugging capabilities, etc
Fast iteration
Uniform adpotion (all consumers using the same version of software, same hardware platforms, etc)

Challenges in cloud:
Requires a new way of building software which isn't well understood by researchers or traditional software companies

Multi tenant
secure
noisy neighbour, oversubscription (and lack of visibility to the end user of the whole platform)
Highly available (but with continuous delivery of updates etc)
Highly monitored (for billing, security, performance, debugging, etc)
Billing models - need to protect the service provider (keep them profitable), remain simple, but models are continually exploited by consumers.

There isn't much academic research on several aspects of cloud challenges for big data processing.

Lessons Learnt:

Cloud development model is superior
2 week releases, immediate feedback, vistbility
state management is difficult at scale
per tenant config, local data, production of VM Images etc
careful testing stratey is crucial
feature flags, stress tests, etc
design to maximise dogfooding (using your own product)

Mobile Storage - Why do we always blame the Storage Stack?

Hao Luo - University of Nebraska

https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/luo

Useability Engineering (1993)
https://www.nngroup.com/articles/response-times-3-important-limits/

0.1 seconds - react instantaneously
1 second - keep the user's flow of thought uninterrupted
10 seconds - keep the user's attention

Most test frameworks are slow and inaccurate as the test workflow doesn't take UI into account. Also often doesn't take platform into account (e.g. you can't test an Android app performance reliably unless using an Android device - how do you automate this?)

This talk looks at the storage stack within mobile applications, measurement of performance, etc.

Question 1 - how does the DB benefit from storage stack optimisations?

in this case, it was 17x
Question 2 - how does the application benefit from storage stack optimisations?
in this case, 5% increase (
Question 3 - why is there such a difference?
The storage (database lookup) component of a mobile app is only a very small part of rendering a screen / webpage.

Filesystem fragmentation in mobile apps

Cheng Ji - University of Hong Kong

https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ji

Emerges when filesystem doesn't have contiguous free space
negatively affects performance

This talk looks at

does fragmentation cause issues on smartphones (flash storage)
Why does fragmentation occur?
How to resolve

Research indicates sqlite files are very fragmented

As very small DB objects are aged out, it leaves small holes in the filesystem.
Fragmentation on an aged smartphone
For example, facebook uses 48 files have 190 fragments, dispersed over 1GB of space. This was even on a new deployment.

Looking at IO latency vs fragmentation

IO latency increased significantly on flash storage when
Increased IO block grequency.
13% increase in block IO reads (due to fragmentation) can lead to a 10% increase in IO latency (both read and write latency)
Dispersed block IO patterns. reads/writes over different flash disk regions.
Amplified IO Frequency when performing fragmented writes.
IO Mapping cache overheads when IO is dispersed over many disk regions.

No defragmentation schemes have been proposed for flash.

Adjust the mapping cache instead of moving data
Proactively avoid fragmentation by allocating contiguous space

Pixelsior - photo management as a platform service for mobile apps

Kyungho Jeon - University at Buffalo

https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/jeon-kyungho

2 billion mobile devices with cameras sold in 2014
3 billion photos shared (instagram, facebook etc) every day in 2015

Popular features:

Sharing over messaging apps
editing a photo and synchronising the updates (e.g. with cloud sync'd files etc)
Image analysis - e.g. Facial recognition, additional metadata, using this for display/search (e.g. sort by date, location, etc)

How do apps use provide these features and what are the problems?

Sharing photos over multiple apps
Each app individually sends the image
Each app further duplicates resizing, or other editing actions
In some situations, the mobile device retains a smaller version whilst the cloud storage maintains the original file
This can cause discrepancies when editing the smaller file on the mobile device, can require the larger file to be manually pulled from the cloud first, etc
Image analysis
Metadata is typically maintained per-app. Even the same metadata (e.g. facial recognition) needs to be individually computed and stored per app.

This presenter is proposing a back-end app called pixelsior to manage images and metadata, providing API access to all apps which subscribe to it (e.g. facebook, instagram, messenger, etc)

99 Deduplication Problems

Philip Shilane - EMC

https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/shilane

A lot of this was with multi tenancy and billing in mind.

QOS - Background tasks which are resource intensive

Garbage Collection
replication
verification fsck
disk reconstruction

How do you charge appropriately for these overheads?

Security and reliability

maintain advantages of dedupe whilst securely storing data

unauthorised access
knowledge of content
data tampering

reliability

the data you need is available when required
dedupe removes redundancy but requires RAID, versioning, replication etc

how do we compare the reliability of these approaches?

Chargeback

QOS across tenants sharing content
Serv Provider myst charge appropriately
too high and a customer can buy storage themselves
too low and the provider loses money
dedupe complicates billing
capacity
CPU
I/O
network
other services
Billing timelines is important.

How do we test whether dedupe is effective?

you need content.
a small handful of standard traces exist, however it isn't precisely representative of real world data

Simulation of replicating data with another layout

Satoshi Iwata, Fujitsu

https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/iwata

Looking at moving to media such as

optical
tape
which is slow but very cheap

A use case could be facebook moving old photos to archive cold storage if they haven't been accessed in a long time.

Also logs (for compliance / evidence)

Logs are collected in time order (sequential) and should be stored in the order or generation

Log mining however disrupts the layout - e.g. you might store logs which contain all events (e.g. users log in, users exec a process, users delete files, etc), but you might wish to report only on file deletion, therefore you can't simply read all data sequentially

Deduplicating compressed contents in cloud storage

Zhichao Yan - University of Texas

https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/yan

One of the key pain points being faced in IT is managing storage growth (including maintaining appropriate performance whilst reducing cost)

A significant amount of data is now moving into the cloud.

End user data access

currently most users still well below 100Mbps at home
Upload speeds significantly lower than downloads
even though when storing in the cloud, far more data is typically sent to the cloud than accessed

Two approaches being used for addressing these

compression (find common strings which can be replaced by more efficient code)
deduplication (find common files/blocks which can be replaced with pointers to single instance of that piece of data)

Common scenario is comress at client side and dedupe at cloud side. Challenges include

different users utilise different compression methods for the same data to compress before sending to the cloud
dedupe works well with raw data/text, but not with shuffled data in compressed packages
redundancy might exist within the compressed data contents.

X-Ray Dedupe

The same checksum value will be generated (with the same checksum method) for compressed data no matter which compression algorithm is used
They are proposing to use this to allow for deduplication of identical data within compressed data packages

The Tail at Scale - how to predict it

Minh Nguyen, University of Texas

https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/nguyen

How to predict tail latency at scale. Will it satisfy latency and throughput performance requirements with thousands of nodes?

When a request arrives, a set of tasks will be scheduled and processed in parallel. The total processing time of the request is determined by the slowest task (eg. if 4 tasks are parallelised, three take 2 seconds and one takes 5 seconds, the whole request is 5 seconds).

The tasks running in parallel may run on distributed platforms with individual response times potentially quite unpredictable (varied loads on each system, queue depths, lack of measurements passed through to load balancers, etc).

This variability is referred to as the latency tail.

The challenge is to ensure that this tail is small, that the latency of each component is predictable.

Some good reading online on this topic, including
http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html
https://syslab.cs.washington.edu/papers/latency-socc14.pdf

This talk was looking at some prediction models, which were more advanced than my comfort level in this topic, so I'll focus more on reading these links.

Interestingly, whislt they weren't necessarily focussing on reducing it (as other papers seem to do), the presenter has proposed a reliable methodology for predicting system tail latency. Generally at higher system load levels (90% or higher), the accuracy was around 95% to 98%.

On the (ir)relevance of Network performance for data processing

Animesh Trivedi - IBM Research, Zurich

https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/trivedi

Present understanding of network in data analytics performance is that a 1Gbps network only really costs 1%-2% compared to 10Gbps network in a clustered environment.

This presenter actually found the opposite - up to 64% improvements between 1Gbps and 10Gbps
TeraSort, PageRank, SQL, WordCount, etc
Terasort in particular was relevant as it maps then shuffles a lot of data over the network between the nodes which re-assemble the data and further sorts.

However, using the same datasets from 10Gbps to 40Gbps, they only saw another 1% - 2% improvement.

Wondering - is this Apache Spark specific?
No

How important is the network?

at 1Gbps, 48%
at 10Gbps, 8%
at 40Gbps, 3%

So as network has largely been optimised, further gains are to be had in optimising CPU, rather than network (for now). Research efforts being put into "how do we make network a problem again" (i.e. how do we solve the CPU bottlenecks).