This is the second day of the Usenix Annual Technical conference June 2016, held in Denver, Colorado, USA.
Specifically, the "HotStorage" and "HotCloud" events were held over the first two days of the conference.
As each presentation was delivered by a single researcher, only this main presenter is referenced in the below text. The links to the Usenix conference page provide more information as well as copies of the papers submitted for the conference.
Keynote - Changes in Big Data (in the past 10 years)
Matei Zaharia - Creator of Apache Spark / Apache Mesos
https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/keynote-address
Having not worked with Big Data platforms, this was an interesting view for me into the evolution of Big Data over the past decade.
- Users: engineers have become analysts
- Hardware: Performance bottleneck has shifted from IO to compute
- Delivery: Strong trend toward public cloud
Initial big data users have been software engineers - Java, C++ etc, building low-level programs to process data
User base was expanded to include scripting / query languages (sql, awk inspired)
- New roels: Data scientists (technical domain experts) and Analysts (business-minded, not overly technical)
Challenges for non-engineers
- API familiarity
- Performance predictability and debugging
- The data set is still massive, standard methods for analysing small data sets are no longer useful for big data sets
- Debugging on distributed systems is complex (e.g. logging needs to be abstracted away from the nodes, joining performance data sets across multiple nodes can be difficult etc)
- Access from snall data tools (tools such as Excel aren't easily connected to big data sets)
Case study: Apache Spark
This is a cluster computing engine, with a large amount of inbuild Libraries and APIs for ML, SQL, Graph processing and streaming.
Original API was aimed at developers, using Java / Scala. Challenges with Functional APIs and using them in an effective/efficient manner.
More recently an increase has been seen in Python adoption etc.
Efficient library for Structured Data includes
- SQL API
- Data Frame API
Dataframes result in significant decrease in storage sizes and a significant increase in performance when analysing the data.
Although Data Frames only 1 year old, a majority of Big Data users are using data frames - huge adoption in just a year.
Hardware Changes in big data
Typical node trends
2010 (per node):
- 50+ MB/s storage performance (HDD)
- 1Gbps Ethernet
- CPU ~3GHz per core
2016 (per node):
- 500MB/s (SSD)
- 10Gbps Ethernet
- CPU ~3GHz per core
Network and storage have increased by about 10x factor
CPU has not increased
Systems are also now being memory-bus bound (e.g. random memory access is quite slow when working with big data sets)
Prior to 2010, IO was the main challenge
Now, compute efficiency is far more important. Further complicated by introduction of GPUs, FPGA etc.
Future network cards will be at ~DRAM bandwidth
Case Study: Spark Efort: Project Tungsten
Optimisation of CPU and memory usage via
- runtime code generation
- Binary encoding (using less space)
Performance due to these code optimisations of almost 10x increase in rows returned per second.
Similar Code improvements are being seen across other platforms (e.g. Nested Vector Language project at MIT), including
- Hyper Database
- GraphMat (e.g. used by PageRank)
- TensorFlow (e.g. used by Word2Vec)
Some challenges in these improvements have been around
- Getting performance improvements through optimised code whilst maintaingin ease of use for non-programmers
- optimisations across multiple libraries/systems
Evolution in Cloud Delivery for big data
Some highlights:
- Many Fortune 100 companies have multiple PB of data in S3 (looking for big data engines to connect to this data and analyse it)
- AWS revenue is now $10B annually
- Recent (2015) survey indicates over 50% of respondents running big data on public cloud
Cloud adoption increasing because
- workloads are bursty
- platform requirements are unpredictable
- changes in big data processing aren't easy to keep up with in your own data center (e.g. dependence on faster networking, faster IO etc)
- Cloud users:
- The purchase an end to end experience, not just components
- Able to rapidly experiement with new solutions/approaches without capital expenditure risk
- For software vendors:
- Better end to end service
- greater visibility, debugging capabilities, etc
- Fast iteration
- Uniform adpotion (all consumers using the same version of software, same hardware platforms, etc)
Challenges in cloud:
Requires a new way of building software which isn't well understood by researchers or traditional software companies
- Multi tenant
- secure
- noisy neighbour, oversubscription (and lack of visibility to the end user of the whole platform)
- Highly available (but with continuous delivery of updates etc)
- Highly monitored (for billing, security, performance, debugging, etc)
- Billing models - need to protect the service provider (keep them profitable), remain simple, but models are continually exploited by consumers.
There isn't much academic research on several aspects of cloud challenges for big data processing.
Lessons Learnt:
- Cloud development model is superior
- 2 week releases, immediate feedback, vistbility
- state management is difficult at scale
- per tenant config, local data, production of VM Images etc
- careful testing stratey is crucial
- feature flags, stress tests, etc
- design to maximise dogfooding (using your own product)
Mobile Storage - Why do we always blame the Storage Stack?
Hao Luo - University of Nebraska
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/luo
Useability Engineering (1993)
https://www.nngroup.com/articles/response-times-3-important-limits/
0.1 seconds - react instantaneously
1 second - keep the user's flow of thought uninterrupted
10 seconds - keep the user's attention
Most test frameworks are slow and inaccurate as the test workflow doesn't take UI into account. Also often doesn't take platform into account (e.g. you can't test an Android app performance reliably unless using an Android device - how do you automate this?)
This talk looks at the storage stack within mobile applications, measurement of performance, etc.
Question 1 - how does the DB benefit from storage stack optimisations?
- in this case, it was 17x
Question 2 - how does the application benefit from storage stack optimisations? - in this case, 5% increase (
Question 3 - why is there such a difference? - The storage (database lookup) component of a mobile app is only a very small part of rendering a screen / webpage.
Filesystem fragmentation in mobile apps
Cheng Ji - University of Hong Kong
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/ji
Emerges when filesystem doesn't have contiguous free space
negatively affects performance
This talk looks at
- does fragmentation cause issues on smartphones (flash storage)
- Why does fragmentation occur?
- How to resolve
Research indicates sqlite files are very fragmented
- As very small DB objects are aged out, it leaves small holes in the filesystem.
- Fragmentation on an aged smartphone
For example, facebook uses 48 files have 190 fragments, dispersed over 1GB of space. This was even on a new deployment.
Looking at IO latency vs fragmentation
- IO latency increased significantly on flash storage when
Increased IO block grequency. - 13% increase in block IO reads (due to fragmentation) can lead to a 10% increase in IO latency (both read and write latency)
- Dispersed block IO patterns. reads/writes over different flash disk regions.
- Amplified IO Frequency when performing fragmented writes.
- IO Mapping cache overheads when IO is dispersed over many disk regions.
No defragmentation schemes have been proposed for flash.
- Adjust the mapping cache instead of moving data
- Proactively avoid fragmentation by allocating contiguous space
Pixelsior - photo management as a platform service for mobile apps
Kyungho Jeon - University at Buffalo
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/jeon-kyungho
2 billion mobile devices with cameras sold in 2014
3 billion photos shared (instagram, facebook etc) every day in 2015
Popular features:
- Sharing over messaging apps
- editing a photo and synchronising the updates (e.g. with cloud sync'd files etc)
- Image analysis - e.g. Facial recognition, additional metadata, using this for display/search (e.g. sort by date, location, etc)
How do apps use provide these features and what are the problems?
- Sharing photos over multiple apps
- Each app individually sends the image
- Each app further duplicates resizing, or other editing actions
- In some situations, the mobile device retains a smaller version whilst the cloud storage maintains the original file
- This can cause discrepancies when editing the smaller file on the mobile device, can require the larger file to be manually pulled from the cloud first, etc
- Image analysis
- Metadata is typically maintained per-app. Even the same metadata (e.g. facial recognition) needs to be individually computed and stored per app.
This presenter is proposing a back-end app called pixelsior to manage images and metadata, providing API access to all apps which subscribe to it (e.g. facebook, instagram, messenger, etc)
99 Deduplication Problems
Philip Shilane - EMC
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/shilane
A lot of this was with multi tenancy and billing in mind.
QOS - Background tasks which are resource intensive
- Garbage Collection
- replication
- verification fsck
- disk reconstruction
How do you charge appropriately for these overheads?
Security and reliability
maintain advantages of dedupe whilst securely storing data
- unauthorised access
- knowledge of content
- data tampering
reliability
- the data you need is available when required
- dedupe removes redundancy but requires RAID, versioning, replication etc
how do we compare the reliability of these approaches?
Chargeback
- QOS across tenants sharing content
- Serv Provider myst charge appropriately
- too high and a customer can buy storage themselves
- too low and the provider loses money
- dedupe complicates billing
- capacity
- CPU
- I/O
- network
- other services
- Billing timelines is important.
How do we test whether dedupe is effective?
- you need content.
- a small handful of standard traces exist, however it isn't precisely representative of real world data
Simulation of replicating data with another layout
Satoshi Iwata, Fujitsu
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/iwata
Looking at moving to media such as
- optical
- tape
which is slow but very cheap
A use case could be facebook moving old photos to archive cold storage if they haven't been accessed in a long time.
Also logs (for compliance / evidence)
- Logs are collected in time order (sequential) and should be stored in the order or generation
Log mining however disrupts the layout - e.g. you might store logs which contain all events (e.g. users log in, users exec a process, users delete files, etc), but you might wish to report only on file deletion, therefore you can't simply read all data sequentially
Deduplicating compressed contents in cloud storage
Zhichao Yan - University of Texas
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/yan
One of the key pain points being faced in IT is managing storage growth (including maintaining appropriate performance whilst reducing cost)
A significant amount of data is now moving into the cloud.
End user data access
- currently most users still well below 100Mbps at home
- Upload speeds significantly lower than downloads
- even though when storing in the cloud, far more data is typically sent to the cloud than accessed
Two approaches being used for addressing these
- compression (find common strings which can be replaced by more efficient code)
- deduplication (find common files/blocks which can be replaced with pointers to single instance of that piece of data)
Common scenario is comress at client side and dedupe at cloud side. Challenges include
- different users utilise different compression methods for the same data to compress before sending to the cloud
- dedupe works well with raw data/text, but not with shuffled data in compressed packages
- redundancy might exist within the compressed data contents.
X-Ray Dedupe
- The same checksum value will be generated (with the same checksum method) for compressed data no matter which compression algorithm is used
- They are proposing to use this to allow for deduplication of identical data within compressed data packages
The Tail at Scale - how to predict it
Minh Nguyen, University of Texas
https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/nguyen
How to predict tail latency at scale. Will it satisfy latency and throughput performance requirements with thousands of nodes?
When a request arrives, a set of tasks will be scheduled and processed in parallel. The total processing time of the request is determined by the slowest task (eg. if 4 tasks are parallelised, three take 2 seconds and one takes 5 seconds, the whole request is 5 seconds).
The tasks running in parallel may run on distributed platforms with individual response times potentially quite unpredictable (varied loads on each system, queue depths, lack of measurements passed through to load balancers, etc).
This variability is referred to as the latency tail.
The challenge is to ensure that this tail is small, that the latency of each component is predictable.
Some good reading online on this topic, including
http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html
https://syslab.cs.washington.edu/papers/latency-socc14.pdf
This talk was looking at some prediction models, which were more advanced than my comfort level in this topic, so I'll focus more on reading these links.
Interestingly, whislt they weren't necessarily focussing on reducing it (as other papers seem to do), the presenter has proposed a reliable methodology for predicting system tail latency. Generally at higher system load levels (90% or higher), the accuracy was around 95% to 98%.
On the (ir)relevance of Network performance for data processing
Animesh Trivedi - IBM Research, Zurich
https://www.usenix.org/conference/hotcloud16/workshop-program/presentation/trivedi
Present understanding of network in data analytics performance is that a 1Gbps network only really costs 1%-2% compared to 10Gbps network in a clustered environment.
This presenter actually found the opposite - up to 64% improvements between 1Gbps and 10Gbps
TeraSort, PageRank, SQL, WordCount, etc
Terasort in particular was relevant as it maps then shuffles a lot of data over the network between the nodes which re-assemble the data and further sorts.
However, using the same datasets from 10Gbps to 40Gbps, they only saw another 1% - 2% improvement.
Wondering - is this Apache Spark specific?
No
How important is the network?
- at 1Gbps, 48%
- at 10Gbps, 8%
- at 40Gbps, 3%
So as network has largely been optimised, further gains are to be had in optimising CPU, rather than network (for now). Research efforts being put into "how do we make network a problem again" (i.e. how do we solve the CPU bottlenecks).