Apache: Big Data 2016: Full Schedule

Register Now or Visit the Website for more Information

10:40am PDT

Big Data DumbOps - Kevin Monroe, Canonical

In this talk, Kevin will explore the idea of Big Data DumbOps -- not "dumb" because standing up a Big Data stack is easy, but "dumb" because it should be. Few people give much thought to apt-get install foo. Why can’t ’foo’ be a Big Data analytics stack, complete with ingestion, processing, and visualization components? The hard part with solving Big Data problems is in the deployment and configuration of all the services that need to work together (NameNodes, DataNodes, ResourceManagers, oh my). Wouldn’t it be great if there was an easy way to model a Big Data platform, stand that up in a cloud, and get down to business? "Yes" is the right answer, and Juju does just that. Kevin will cover some of the Big Data services available in the Juju ecosystem (Hadoop, Spark, Kafka, etc) and then show how easily these can be deployed to make way for the real fun -- solving Big Data problems.

Speakers

Kevin Monroe

Canonical

During his tenure at IBM, Kevin’s projects covered a wide range of development, from tiny embedded operating systems to Linux enablement of the POWER8 hardware platform. Kevin moved to Canonical in 2014 with his focus set on modeling workload deployments at scale. He found his niche... Read More →

Big Data DumbOps pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Georgia B

Operations-Use Cases, Intermediate

11:40am PDT

Big Data in Biology - Omkar Reddy, Dirubhai Ambani University, India

Big Data in Biology - Big data has been the key player in data mining and analytics. Large amount of data is shared and dumped on the internet everyday. Our body can be considered as a big mine of data that is being explored since years. In this presentation we will be viewing recent discoveries and implementations of big data in biology. We will see the amount of data a single genome of a single protein produces and how it is very useful to study and find cure and preventions for many diseases and viruses. We will also be looking at different sectors of biology and the data produced in each of the sectors and the significance of big data in studying this data. We will also be looking at potential technologies that can be engineered using big data analytics tools and data mining to predict, track and cure diseases. We will also take a look at how big data influenced synthetic biology.

Speakers

Omkar Reddy

Dirubhai Ambani Institute of Information and Communication Technology

I am Omkar Reddy, a student pursuing my B.Tech in Information and Communication Technology in Dirubhai Ambani University, India. I am fond of computers science and algorithms. I have been involved with the Apache Open Climate Workbench project. I am a member of the Project Management... Read More →

Big Data in Biology pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Georgia B

Operations-Use Cases, Beginner

2:00pm PDT

Standing on Shoulders of Giants: Ampool Story - Milind Bhandarkar, Ampool, Inc.

Today, if unforeseen events change the decision model, we wait until the next batch model build for new insights. By extending fast “time-to-decisions” into the world of Big Data Analytics to get fast “time-to-insights”, applications will get what used to be batch insights in near real time. Enabling this is technology such as smart in-memory data storage, new storage class memory, and products designed to do one or more parts of an analysis pipeline very well. In this talk we describe how Ampool is contributing to, and building upon Apache Geode & several other ASF projects (in Open Data Platform Initiative, ODPi) to allow Big Data analysis solutions to work together with a scalable smart storage class memory layer to allow fast & complex end to end pipelines to be built- closing the loop and providing dramatically lower time to critical insights.

Speakers

Milind Bhandarkar

Founder, Ampool

Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system. Parallel programming languages and paradigms has been his area of focus for over 20 years. He worked at several HPC companies, Yahoo... Read More →

Standing on Shoulders of Giants Ampool Story pdf

Monday May 9, 2016 2:00pm - 2:50pm PDT
Georgia B

Operations-Use Cases, Intermediate

3:00pm PDT

Netflix Keystone - How We Built a 700B/day Stream Processing Cloud Platform in a Year - Peter Bakas, Netflix

Keystone processes over 700 billion events per day with at-least once processing semantics in the cloud. We will explore in detail how we have modified and leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in the Amazon AWS cloud within a year.

* Pipeline Evolution & Architecture
* Why we chose Kafka, Samza, Docker
* How to effectively use these technologies together in the Cloud
* Alterations to Kafka & Samza
* Scaling and Managing Kafka, Samza & Docker
* Deployment & Monitoring details
* Fault tolerance and failover strategies
* Performance numbers

Speakers

Peter Bakas

Netflix

Peter leads the Real Time Data Infrastructure team at Netflix. His team is responsible for building common infrastructure to collect, transport, aggregate, process and visualize over 700 billion events a day. Prior to Netflix, Peter has built and led teams responsible for developing... Read More →

How We Built a Stream Processing Cloud Platform pdf

Monday May 9, 2016 3:00pm - 3:50pm PDT
Georgia B

Operations-Use Cases, Advanced

4:10pm PDT

Data Science with News Headlines - Analyzing and Visualizing a Whole Decade - Christian Winkler & Stephanie Fischer, mgm Technology Partners GmbH

We will show how to use Apache tools to dig through unstructured text, analyze and visualize the data and iterate this to create a compelling visual experience and drill down the data. As visualizations we will use word clouds and histograms from D3.js. We will explain the whole procedure from getting, preparing and indexing real-world data to formulating the query and using the results.

Then we will present the animated visualization of the above data. We will demonstrate the flexibility of our Big Data solution and show how it can be adopted to specific users’ needs in realtime. Data-wise, we will also visualize Hacker News. We will elaborate on both the benefits and the limits of word clouds as a Big Data visualization tool.

We will give an outlook to possible other use cases (like mailing list mood detection, wikis etc.) and talk about detecting trends and outliers.

Speakers

Stephanie Fischer

Big Data, Agile and Change Management, mgm consulting partners

I concentrate on user-centricity of Big Data technologies. My focus is finding the questions really worth solving. I think Big Data has the potential to advance humanity into a desirable direction. I have a background in organizational development, agility and business analytics... Read More →

Christian Winkler

Enterprise architect, mgm technology partners GmbH

Christian has worked for 20 years with Internet technologies. Recently, he has focused on working with large amounts of data or many users. As big data applications become more and more popular, lots of applications evolve. Many aggregates have to be calculated to describe charcteristics... Read More →

data science with news headlines pdf

Monday May 9, 2016 4:10pm - 5:00pm PDT
Georgia B

Operations-Use Cases, Beginner

5:10pm PDT

Building a Durable Real-Time Data Pipeline: Apache BookKeeper at Twitter - Sijie Guo & Leigh Stewart, Twitter

Log has been proven to be a very powerful data structure for addressing challenging distributed systems problems. DistributedLog is such a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s durable real-time data pipeline, and has been used widely elsewhere at Twitter in applications including transactional database system, search ingestion pipeline, and real-time streaming data-analytics platform. In this talk, Sijie Guo will discuss what are the challenges on building durable real-time data pipeline, how they achieve it and how they use it to support different workloads with different characteristics from a strongly-consistent distributed database to a real-time data analytics pipeline.

Speakers

Sijie Guo

Twitter

Currently work for Twitter on DistributedLog/BooKeeper. Apache BookKeeper PMC Chair. Previously work for Yahoo! on push notification system.

Leigh Stewart

Twitter

Building a Durable Real Time Data Pipeline Apache BookKeeper at Twitter pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Georgia B

Operations-Use Cases, Any

9:00am PDT

Cancer Outlier Profile Analysis Using Spark - Mahmoud Parsian, Illumina, Inc.

Cancer Outlier Profile Analysis (COPA) is a method to find genes
that undergo recurrent fusion in a given cancer type by finding
pairs of genes that have mutually exclusive outlier profiles.
COPA is used for detecting translocations of the second type
using microarray data. The goal of COPA is to identify genes
that have a subset of disease samples with outstanding high/low
values. We have implemented COPA in Spark for production, which
can process millions of biomarkers for one-sided and two-sided
analysis, where each biomarker may have thousands of genes.
Selection of the Spark for COPA implementation was a natural
choice, since Spark offers natural join and filter operations
(main steps in COPA implementation) in a very high level manner,
which is lacking from traditional MapReduce API. This presentation
will show how we used Spark to solve a complex COPA.

Speakers

Mahmoud Parsian

Illumina, Inc.

Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Hadoop, Spark, and distributed... Read More →

Cancer Outlier Profile Analysis Using Spark pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Georgia B

Operations-Use Cases, Intermediate

10:00am PDT

Breaking Spark: Top 5 Mistakes to Avoid When Using Apache Spark in Production - Neelesh Srinivas Salian, Cloudera

Apache Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and it continues to push the boundaries of the engine.

This talk will focus on common problematic issues observed in a cluster environment setup with Apache Spark, based on the presenter’s experiences across 150+ production deployments.

When planning a Apache Spark deployment in a cluster, it is recommended to follow certain guidelines to help setup a real-world environment. The classification of issues that can occur are:

1) Scaling of the Architecture
2) Memory Configurations
3) End user Code
4) Incompatible Dependencies
5) Administration/Operation related issues.

These observations are very useful as they help to improve the usability and supportability of Apache Spark to avoid such issues in future deployments.

Speakers

Neelesh Srinivas Salian

Software Engineer, Stitch Fix

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by data scientists. He helps build services that are part of Stitch Fix’s Data Warehouse ecosystem. Currently he is working to build Data Lineage... Read More →

Spark Talk Cloudera pdf

Tuesday May 10, 2016 10:00am - 10:50am PDT
Georgia B

Operations-Use Cases, Beginner

11:20am PDT

A Java Implementer’s Guide to Boosting Apache Spark Performance - Tim Ellison, IBM

Apache Spark has rocked the big data landscape, becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark’s core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk Tim Ellison, a JVM developer at IBM, shows some of the unique Java 8 capabilities in the JIT compiler, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. Tim will demonstrate how solutions, previously infeasible with regular Java programming, become possible with this high performance Spark core runtime, enabling you to solve problems smarter and faster.

Speakers

Tim Ellison

Tim Ellison is currently a Senior Technical Staff Member with IBM's Java Technology Centre in the UK. He has worldwide responsibility for Open Source Engineering in the Java SDK underpinning a broad selection of IBM's flagship products. He is a Member of the Apache Software Foundation... Read More →

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Georgia B

Operations-Use Cases, Intermediate