Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Operations-Use Cases [clear filter]
Monday, May 9

10:40am PDT

Big Data DumbOps - Kevin Monroe, Canonical
In this talk, Kevin will explore the idea of Big Data DumbOps -- not "dumb" because standing up a Big Data stack is easy, but "dumb" because it should be. Few people give much thought to apt-get install foo. Why can’t ’foo’ be a Big Data analytics stack, complete with ingestion, processing, and visualization components? The hard part with solving Big Data problems is in the deployment and configuration of all the services that need to work together (NameNodes, DataNodes, ResourceManagers, oh my). Wouldn’t it be great if there was an easy way to model a Big Data platform, stand that up in a cloud, and get down to business? "Yes" is the right answer, and Juju does just that. Kevin will cover some of the Big Data services available in the Juju ecosystem (Hadoop, Spark, Kafka, etc) and then show how easily these can be deployed to make way for the real fun -- solving Big Data problems.


Kevin Monroe

During his tenure at IBM, Kevin’s projects covered a wide range of development, from tiny embedded operating systems to Linux enablement of the POWER8 hardware platform. Kevin moved to Canonical in 2014 with his focus set on modeling workload deployments at scale. He found his niche... Read More →

Monday May 9, 2016 10:40am - 11:30am PDT
Georgia B

11:40am PDT

Big Data in Biology - Omkar Reddy, Dirubhai Ambani University, India
Big Data in Biology - Big data has been the key player in data mining and analytics. Large amount of data is shared and dumped on the internet everyday. Our body can be considered as a big mine of data that is being explored since years. In this presentation we will be viewing recent discoveries and implementations of big data in biology. We will see the amount of data a single genome of a single protein produces and how it is very useful to study and find cure and preventions for many diseases and viruses. We will also be looking at different sectors of biology and the data produced in each of the sectors and the significance of big data in studying this data. We will also be looking at potential technologies that can be engineered using big data analytics tools and data mining to predict, track and cure diseases. We will also take a look at how big data influenced synthetic biology.


Omkar Reddy

Dirubhai Ambani Institute of Information and Communication Technology
I am Omkar Reddy, a student pursuing my B.Tech in Information and Communication Technology in Dirubhai Ambani University, India. I am fond of computers science and algorithms. I have been involved with the Apache Open Climate Workbench project. I am a member of the Project Management... Read More →

Monday May 9, 2016 11:40am - 12:30pm PDT
Georgia B

2:00pm PDT

Standing on Shoulders of Giants: Ampool Story - Milind Bhandarkar, Ampool, Inc.
Today, if unforeseen events change the decision model, we wait until the next batch model build for new insights. By extending fast “time-to-decisions” into the world of Big Data Analytics to get fast “time-to-insights”, applications will get what used to be batch insights in near real time. Enabling this is technology such as smart in-memory data storage, new storage class memory, and products designed to do one or more parts of an analysis pipeline very well. In this talk we describe how Ampool is contributing to, and building upon Apache Geode & several other ASF projects (in Open Data Platform Initiative, ODPi) to allow Big Data analysis solutions to work together with a scalable smart storage class memory layer to allow fast & complex end to end pipelines to be built- closing the loop and providing dramatically lower time to critical insights.

avatar for Milind Bhandarkar

Milind Bhandarkar

Founder, Ampool
Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system. Parallel programming languages and paradigms has been his area of focus for over 20 years. He worked at several HPC companies, Yahoo... Read More →

Monday May 9, 2016 2:00pm - 2:50pm PDT
Georgia B

3:00pm PDT

Netflix Keystone - How We Built a 700B/day Stream Processing Cloud Platform in a Year - Peter Bakas, Netflix
Keystone processes over 700 billion events per day with at-least once processing semantics in the cloud. We will explore in detail how we have modified and leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in the Amazon AWS cloud within a year.

* Pipeline Evolution & Architecture
* Why we chose Kafka, Samza, Docker
* How to effectively use these technologies together in the Cloud
* Alterations to Kafka & Samza
* Scaling and Managing Kafka, Samza & Docker
* Deployment & Monitoring details
* Fault tolerance and failover strategies
* Performance numbers


Peter Bakas

Peter leads the Real Time Data Infrastructure team at Netflix. His team is responsible for building common infrastructure to collect, transport, aggregate, process and visualize over 700 billion events a day. Prior to Netflix, Peter has built and led teams responsible for developing... Read More →

Monday May 9, 2016 3:00pm - 3:50pm PDT
Georgia B

4:10pm PDT

Data Science with News Headlines - Analyzing and Visualizing a Whole Decade - Christian Winkler & Stephanie Fischer, mgm Technology Partners GmbH
We will show how to use Apache tools to dig through unstructured text, analyze and visualize the data and iterate this to create a compelling visual experience and drill down the data. As visualizations we will use word clouds and histograms from D3.js. We will explain the whole procedure from getting, preparing and indexing real-world data to formulating the query and using the results.

Then we will present the animated visualization of the above data. We will demonstrate the flexibility of our Big Data solution and show how it can be adopted to specific users’ needs in realtime. Data-wise, we will also visualize Hacker News. We will elaborate on both the benefits and the limits of word clouds as a Big Data visualization tool.

We will give an outlook to possible other use cases (like mailing list mood detection, wikis etc.) and talk about detecting trends and outliers.

avatar for Stephanie Fischer

Stephanie Fischer

Big Data, Agile and Change Management, mgm consulting partners
I concentrate on user-centricity of Big Data technologies. My focus is finding the questions really worth solving. I think Big Data has the potential to advance humanity into a desirable direction. I have a background in organizational development, agility and business analytics... Read More →
avatar for Christian Winkler

Christian Winkler

Enterprise architect, mgm technology partners GmbH
Christian has worked for 20 years with Internet technologies. Recently, he has focused on working with large amounts of data or many users. As big data applications become more and more popular, lots of applications evolve. Many aggregates have to be calculated to describe charcteristics... Read More →

Monday May 9, 2016 4:10pm - 5:00pm PDT
Georgia B

5:10pm PDT

Building a Durable Real-Time Data Pipeline: Apache BookKeeper at Twitter - Sijie Guo & Leigh Stewart, Twitter
Log has been proven to be a very powerful data structure for addressing challenging distributed systems problems. DistributedLog is such a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s durable real-time data pipeline, and has been used widely elsewhere at Twitter in applications including transactional database system, search ingestion pipeline, and real-time streaming data-analytics platform. In this talk, Sijie Guo will discuss what are the challenges on building durable real-time data pipeline, how they achieve it and how they use it to support different workloads with different characteristics from a strongly-consistent distributed database to a real-time data analytics pipeline.


Sijie Guo

Currently work for Twitter on DistributedLog/BooKeeper. Apache BookKeeper PMC Chair. Previously work for Yahoo! on push notification system.

Monday May 9, 2016 5:10pm - 6:00pm PDT
Georgia B
Tuesday, May 10

9:00am PDT

Cancer Outlier Profile Analysis Using Spark - Mahmoud Parsian, Illumina, Inc.
Cancer Outlier Profile Analysis (COPA) is a method to find genes
that undergo recurrent fusion in a given cancer type by finding
pairs of genes that have mutually exclusive outlier profiles.
COPA is used for detecting translocations of the second type
using microarray data. The goal of COPA is to identify genes
that have a subset of disease samples with outstanding high/low
values. We have implemented COPA in Spark for production, which
can process millions of biomarkers for one-sided and two-sided
analysis, where each biomarker may have thousands of genes.
Selection of the Spark for COPA implementation was a natural
choice, since Spark offers natural join and filter operations
(main steps in COPA implementation) in a very high level manner,
which is lacking from traditional MapReduce API. This presentation
will show how we used Spark to solve a complex COPA.


Mahmoud Parsian

Illumina, Inc.
Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Hadoop, Spark, and distributed... Read More →

Tuesday May 10, 2016 9:00am - 9:50am PDT
Georgia B

10:00am PDT

Breaking Spark: Top 5 Mistakes to Avoid When Using Apache Spark in Production - Neelesh Srinivas Salian, Cloudera
Apache Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and it continues to push the boundaries of the engine.

This talk will focus on common problematic issues observed in a cluster environment setup with Apache Spark, based on the presenter’s experiences across 150+ production deployments.

When planning a Apache Spark deployment in a cluster, it is recommended to follow certain guidelines to help setup a real-world environment. The classification of issues that can occur are:

1) Scaling of the Architecture
2) Memory Configurations
3) End user Code
4) Incompatible Dependencies
5) Administration/Operation related issues.

These observations are very useful as they help to improve the usability and supportability of Apache Spark to avoid such issues in future deployments.

avatar for Neelesh Srinivas Salian

Neelesh Srinivas Salian

Software Engineer, Stitch Fix
Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by data scientists. He helps build services that are part of Stitch Fix’s Data Warehouse ecosystem. Currently he is working to build Data Lineage... Read More →

Tuesday May 10, 2016 10:00am - 10:50am PDT
Georgia B

11:20am PDT

A Java Implementer’s Guide to Boosting Apache Spark Performance - Tim Ellison, IBM
Apache Spark has rocked the big data landscape, becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark’s core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk Tim Ellison, a JVM developer at IBM, shows some of the unique Java 8 capabilities in the JIT compiler, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. Tim will demonstrate how solutions, previously infeasible with regular Java programming, become possible with this high performance Spark core runtime, enabling you to solve problems smarter and faster.

avatar for Tim Ellison

Tim Ellison

Tim Ellison is currently a Senior Technical Staff Member with IBM's Java Technology Centre in the UK. He has worldwide responsibility for Open Source Engineering in the Java SDK underpinning a broad selection of IBM's flagship products. He is a Member of the Apache Software Foundation... Read More →

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Georgia B

2:00pm PDT

Using Apache Big Data Stack to Analyse Storm-Scale Numerical Weather Prediction Data - Suresh Marru, Indiana University
This talk will discuss adaptation of Apache Big Data Technologies to analyze large, self-described, structured scientific data sets. We will present initial results for the problem of analyzing petabytes of weather forecasting simulation data produced as part of National Oceanic and Atmospheric Administration’s annual Hazardous Weather Testbed. The challenge is to enable weather researchers to perform investigative queries over the full forecast simulation outputs to find the signatures for severe weather phenomena like tornadogenesis. Given the size of the data and the complexity of weather phenomena, these data sets are candidates for exploration by machine learning techniques that can identify heretofore unknown relationships in the dozens of weather parameters generated by the simulations, guiding researchers into developing new scientific models.

avatar for Suresh Marru

Suresh Marru

Member, Indiana University
Suresh Marru is a Member of the Apache Software Foundation and is the current PMC chair of the Apache Airavata project. He is the deputy director of Science Gateways Research Center at Indiana University. Suresh focuses on research topics at the intersection of application domain... Read More →

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Georgia B

3:00pm PDT

Focused Crawling with Apache Nutch - Sujen Shah, NASA JPL
The vast nature of the Web has forced researchers to continually develop advanced data acquisition strategies that overcome a multitude of obstacles in order to acquire relevant topical content and assimilate it with their needs. Many groups have researched focused Web crawling techniques in order to better guide their data acquisition efforts, however few approaches consider the scenario where one wishes to undertake DD on the open Web for which no prior semantic knowledge resources are available. Sujen and his team have investigated and developed a new application of the cosine similarity metric (CSM) which has been implemented as part of a novel strategy for domainspecificDD. 

In this presentation, Sujen would review the recent work in focused crawling and the ability to run similarity scoring within a production ready, scalable Web crawler, Apache Nutch.

avatar for Sujen Shah

Sujen Shah

Research Intern, NASA Jet Propulsion Laboratory
Sujen is a Masters student pursuing Computer Science at the University of Southern California, Los Angeles. As a committer and member of the Apache Nutch PMC, his work includes augmenting the focused crawling capabilities of Nutch. These new scoring plugins are supporting the efforts... Read More →

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Georgia B
Wednesday, May 11

10:50am PDT

Tailored for Spark - Petr Igrevski, eBay
We went big with Spark at eBay. Let us tell you the story how we built a custom tailored Spark system leveraging cloud and disaggregated storage. Watch us demonstrate our Spark developer experience as we walk you through our custom Spark as a service offering. Come and learn how eBay embraced Spark, how we created a delightful environment for our data developers, and how we use this environment today.


Wednesday May 11, 2016 10:50am - 11:40am PDT
Plaza B

11:50am PDT

Network DVR Meets Big Data - Stephen Kraiman, ARRIS
Network traffic for massive data ingest systems is often ignored, yet can become a significant cost factor when designing a cluster. Network traffic on a Hadoop and object store based network DVR recorders was simulated with surprising results.

A network DVR is an example of a class of application that generates massive amounts of data on the cluster. This session explores how different implementation models affect the network traffic generated. The presenters will explore the implementation and the simulations results. The presentation will cover a variety of open source technologies including HDFS, Spark, and Kafka.


Stephen Kraiman

Stephen Kraiman, Principal Architect at ARRIS, is primarily focused on the design of systems and CDN technology that monetize the storage, management and transport of video over IP networks. Stephen was cofounder for Digital Video Arts Ltd, which was acquired by SeaChange International... Read More →

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Plaza B

2:00pm PDT

ODPi and ASF Collaboration: Ask Us Anything! - John Mertic, ODPi & Jim Jagielski, Apache Software Foundation

The Apache Software Foundation (ASF) has long been the champion of open source projects that compose the larger Apache Hadoop ecosystem. ODPi is complementary to those efforts, solely focused on easing integration and standardization for downstream application vendors and end-users that build upon Apache Hadoop®. Since ODPi’s launch in 2015, there has been some confusion around how its work may overlap, or potentially compete, with that of the ASF.

Founding Member and Board Director - Apache Software Foundation, Jim Jagielski, and Director of Program Management for ODPi, John Mertic, will clear up this confusion. During the discussion, attendees will learn how ASF and ODPi are collaborating to accelerate enterprise adoption of Apache Hadoop and big data technologies. There will also be an open Q&A, where attendees can ask about ASF and ODPi projects, their work together, where the big data ecosystem is heading, and anything else that comes to mind.

avatar for Jim Jagielski

Jim Jagielski

Developer, Uber
Jim Jagielski is a well-known and acknowledged expert and visionary in open source, an accomplished coder, and frequent engaging presenter on all things open, web, blockchain, and cloud related. As a developer, he’s made substantial code contributions to just about every core technology... Read More →
avatar for John Mertic

John Mertic

Director of Program Management, The Linux Foundation
John Mertic is the Director of Program Management for The Linux Foundation. Under his leadership, he has helped ASWF, ODPi, Open Mainframe Project, and R Consortium accelerate open source innovation and transform industries. John has an open source career spanning two decades, both... Read More →

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Plaza B

3:00pm PDT

Scylla: A Revolutionary Design for NoSQL Performs at 1.8M TPS/node - Don Marti & Tzach Livyatan, ScyllaDB
Scylla is a new NoSQL database, compatible with Apache Cassandra, that is capable of a 10x improvement in throughput on the same hardware, with predictable low latency that dramatically improves the performance of analytics originally developed for Cassandra. The database is now in use in production and in pilot projects internationally.

Scylla applies kernel programming techniques to a horizontally scalable NoSQL design to achieve extreme performance improvements and the elimination of garbage collection pauses. The Scylla design is based on a modern shared-nothing approach.   A new architecture for the NoSQL server is necessary because of new growth in, and limitations of, modern server hardware. As CPU core counts continue to grow, along with the raw speed of networking and storage devices available on a modern system, software design approaches that were valid and safe even a few years ago are no longer sustainable. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC.

With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Scylla enables faster cluster scaling, more overhead to handle complex queries, and the power to do complex analytics tasks at the same time as routine administration operations.

avatar for Tzach Livyatan

Tzach Livyatan

VP Product, Scylla
Tzach Livyatan has a B.A. and MSc in Computer Science (Technion, Summa Cum Laude), and has had a 15 year career in development, system engineering and product management. In the past he worked in the Telecom domain, focusing on carrier-grade systems, signalling, policy and charging... Read More →

Don Marti

Don Marti has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and... Read More →

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Plaza B

4:10pm PDT

Apache HBase: Overview and Use Cases - Sean Busbey, Cloudera Inc
NoSQL databases are critical in building Big Data applications. Apache HBase, one of the most popular NoSQL databases, is used by Facebook, Apple, eBay and hundreds of other enterprises to store, analyze and profit from their petabyte-scale volume of data. This tutorial, using hands-on session with Apache HBase, will explain basic concepts of non-relational databases. Then we’ll explore some commonly seen big data usage patterns in industry, and when & how to use Apache HBase (or other better suited NoSQL database).


Sean Busbey

Sean Busbey currently works at Cloudera as a software engineer on distributed storage systems. In addition to being a Member of the Apache Software Foundation, he is actively involved in several projects including: HBase, Yetus, Avro, NiFi, and Accumulo. Outside of the ASF, he is... Read More →

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Plaza A

4:10pm PDT

Data Science for the Datacenter: Analyzing Logs with Apache Spark - William Benton, Red Hat, Inc
Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured.

In this session, Will Benton will introduce the log processing domain and give you practical advice for using Apache Spark to analyze log data, including data engineering techniques to impose structure on disparate log sources; data science approaches to detect infrastructure failures; language-processing techniques to characterize the text of log messages; best practices for tuning Spark and using newer Spark features; and how to visualize your results. You’ll learn from Benton’s experience developing applications that analyze the vast log data generated within Red Hat’s network and leave well-prepared to analyze your own logs.

avatar for William Benton

William Benton

Manager, Software Engineering and Sr. Principal Engineer, Red Hat, Inc
William Benton leads a team of data scientists and engineers at Red Hat, where he has applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy... Read More →

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Plaza B

5:10pm PDT

Data Management at Scale - Tom Barber, Meteroite Consulting
Apache OODT is relatively easy to get up and running with the RADiX distribution but how do you administer it at scale?

Managing a data management cluster can be daunting, especially when its distributed around the globe in various data centres. We’ll take a look at options for large scale distributed roll outs of Apache OODT across multiple continents and how to connect, support and administer them, maximising the throughput of the system and ensuring users have access to all the data they require.

Container technology has drastically altered the DevOps landscape, using service orchestration tools and Apache MESOS to maintain your cluster can make managing OODT relatively easy and infinitely scalable and also how to connect it to other data services. Find out more in a live (seat of the pants) demo.

avatar for Tom Barber

Tom Barber

Technical Director, Spicule LTD
Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Plaza B