Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monitoring-Benchmarking [clear filter]
Wednesday, May 11

10:50am PDT

Using a Relative Index of Performance (RIP) to Determine Optimum Configuration Settings Compared to Random Forest Assessment Using Spark - Diane Feddema, Red Hat Inc, Canada
Computer Systems can be set with a myriad of options, determining an optimal set-up for any particular application can be difficult. This pilot study demonstrates how numerous I/O performance tests with varied hardware and software configurations can be efficiently compared to determine an optimal set-up for an application. To simplify this process a statistic was developed to provide a quick relative performance comparison. This metric can be arithmetically manipulated to provide meaningful averaging of multiple performance tests into a single overall performance indicator. We will illustrate how RIP is used by comparing I/O performance test results on approximately 50-100 different hardware/software set-ups; RIP results will be compared to results from a random forest repeat sampling technique to determine the most influential performance factors.

avatar for Diane Feddema

Diane Feddema

Principal Software Engineer, Red Hat
Diane Feddema is a principal software engineer at Red Hat Inc Canada, in the AI Center of Excellence. Diane is currently focused on developing and applying machine learning techniques for performance analysis using hardware accelerators, automating these analyses and displaying data... Read More →

Wednesday May 11, 2016 10:50am - 11:40am PDT
Georgia B

11:50am PDT

Experiences Using Apache HTrace (Incubating) in Distributed Web Search - Lewis McGibbney, NASA JPL
Recent developments within the tracing community have brought projects like Apache HTrace (Incubating) into the Apache Incubator opening up the possibility of utilizing tracing logic to better understand distributed applications, systems and systems-of-systems. As many will know, tracing involves a specialized use of logging to record information about a program’s execution. Although many use cases involve the use of tracing within distributed systems such as Hadoop and databases, few tracing experiments belong within the field of large scale, distributed Web search. This presentation will combine comprehensive tracing mechanisms in Apache HTrace (Incubating) with the scalable, flexible crawling architecture presented by Apache Nutch. Key takeaways from this presentation are development and implementation, tracing guidance for your web search stack and future work in this area.

avatar for Lewis McGibbney

Lewis McGibbney

Enterprise Search Technologist III, Jet Propulsion Laboratory

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Georgia B

2:00pm PDT

HiBench - The Benchmark Suite for Hadoop, Spark and Streaming - Carson Wang, Intel
HiBench is an open sourced and Apache licensed big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, PageRank, Bayes, Kmeans, enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Storm and Samza. In this presentation, Carson Wang will introduce the features of HiBench and go through how to use HiBench to benchmark different big data frameworks. It will also cover tuning guides for workloads with different characterization.


Carson Wang

Carson Wang is a software engineer from Intel big data team. He is an active open source contributor to the Spark and Tachyon projects.

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Georgia B

3:00pm PDT

Monitoring in a Distributed World - Felix Massem, codecentric AG
The IT infrastructure for distributed applications is getting bigger and more complex every day. Through this, the pure mass of observed events is growing. To be able to ensure a safe IT operation, we also need a distributed and scalable monitoring architecture to evaluate these events. This session wants to show how to build an architecture upon open source software.
Starting with some basics on monitoring IT infrastructure and applications, we will have a look on some of the key words like monitoring, alerting, diagnostic and reporting. Based on this, we will start to build up a monitoring architecture.
We will elaborate on and integrate the following modules: log file shipping and analysis (logstash), system monitoring (collectD), event storage (elasticsearch), metric generator and storage (statsd and graphite) as well as different dashboards (grafana, seyren, kibana).


Felix Massem

codecentric AG
Felix Massem works as a consultant for codecentric AG. His main focus is in the area of Continuous Delivery and technologies around infrastructure as code and log analysis. Beside this, he is most interested in topics like DevOps, Data Minig and Big Data technologies. As an author... Read More →

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Georgia B

4:10pm PDT

Effective HBase Healthcheck and Troubleshooting - Jayesh Thakrar, Conversant
We all know of HBase as a robust, resilient, scalable, and performant big data datastore. Once configured well, it can run hands-off for months without need for any maintenance or care-and-feed. The only occassional attention needed is hardware maintenance and system troubleshooting. Since an HBase cluster is often made up of several servers and the system could be on "auto-pilot", its the applications that may notice problems first when they occur. At those times, identifying and resolving the root-cause or symptom needs to be done quickly.

Other than HDFS itself, HBase is probably the oldest and most mature component of the Hadoop ecosystem and it is budled with a number of tools and utilities. This presentation will cover how to effectively make them part of your troubleshooting toolbox as well as to formulate your own key performance and health indicators.

avatar for Jayesh Thakrar

Jayesh Thakrar

Sr. Software Engineer, Conversant
Jayesh Thakrar is a Sr. Data Engineer at Conversant (http://www.conversantmedia.com/). He is a data geek who gets to build and play with large data systems consisting of Hadoop, Spark, HBase, Cassandra, Flume and Kafka. To rest after a good day's work, he uses OpenTSDB with 500+ million... Read More →

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Georgia B

5:10pm PDT

Less Is More: Doubling Storage Efficiency with HDFS Erasure Coding - Zhe Zhang, LinkedIn & Kai Zheng, Intel
Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting expensive: the default 3x replication scheme incurs a 200% storage overhead. Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.

In this talk we will introduce the design and implementation of HDFS-EC, and recommended use cases. We will also provide preliminary performance results. Equipped with the Intel ISA-L library, HDFS-EC has largely eliminated the computational overhead in codec calculation. Under sequential I/O workloads, it achieves twice the throughput compared with 3x replication, by performing striped I/O to multiple DataNodes in parallel.


Zhe Zhang

Zhe Zhang is a software engineer at LinkedIn working on Hadoop. He’s an Apache Hadoop Committer and author of HDFS Erasure Coding. Before joining LinkedIn in Feburary 2016 Zhe was an engineer in Cloudera HDFS team. Prior to that he worked at the IBM T. J. Watson Research Center... Read More →

Kai Zheng

Kai is a senior software engineering in Intel that works in big data and security fields for quite a few of years. He is a key Apache Kerby initiator, Directory PMC member and Apache Hadoop committer.

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Georgia B