Apache: Big Data 2016: Full Schedule

Register Now or Visit the Website for more Information

10:40am PDT

Big Data DumbOps - Kevin Monroe, Canonical

In this talk, Kevin will explore the idea of Big Data DumbOps -- not "dumb" because standing up a Big Data stack is easy, but "dumb" because it should be. Few people give much thought to apt-get install foo. Why can’t ’foo’ be a Big Data analytics stack, complete with ingestion, processing, and visualization components? The hard part with solving Big Data problems is in the deployment and configuration of all the services that need to work together (NameNodes, DataNodes, ResourceManagers, oh my). Wouldn’t it be great if there was an easy way to model a Big Data platform, stand that up in a cloud, and get down to business? "Yes" is the right answer, and Juju does just that. Kevin will cover some of the Big Data services available in the Juju ecosystem (Hadoop, Spark, Kafka, etc) and then show how easily these can be deployed to make way for the real fun -- solving Big Data problems.

Speakers

Kevin Monroe

Canonical

During his tenure at IBM, Kevin’s projects covered a wide range of development, from tiny embedded operating systems to Linux enablement of the POWER8 hardware platform. Kevin moved to Canonical in 2014 with his focus set on modeling workload deployments at scale. He found his niche... Read More →

Big Data DumbOps pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Georgia B

Operations-Use Cases, Intermediate

2:00pm PDT

Standing on Shoulders of Giants: Ampool Story - Milind Bhandarkar, Ampool, Inc.

Today, if unforeseen events change the decision model, we wait until the next batch model build for new insights. By extending fast “time-to-decisions” into the world of Big Data Analytics to get fast “time-to-insights”, applications will get what used to be batch insights in near real time. Enabling this is technology such as smart in-memory data storage, new storage class memory, and products designed to do one or more parts of an analysis pipeline very well. In this talk we describe how Ampool is contributing to, and building upon Apache Geode & several other ASF projects (in Open Data Platform Initiative, ODPi) to allow Big Data analysis solutions to work together with a scalable smart storage class memory layer to allow fast & complex end to end pipelines to be built- closing the loop and providing dramatically lower time to critical insights.

Speakers

Milind Bhandarkar

Founder, Ampool

Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system. Parallel programming languages and paradigms has been his area of focus for over 20 years. He worked at several HPC companies, Yahoo... Read More →

Standing on Shoulders of Giants Ampool Story pdf

Monday May 9, 2016 2:00pm - 2:50pm PDT
Georgia B

Operations-Use Cases, Intermediate

9:00am PDT

Cancer Outlier Profile Analysis Using Spark - Mahmoud Parsian, Illumina, Inc.

Cancer Outlier Profile Analysis (COPA) is a method to find genes
that undergo recurrent fusion in a given cancer type by finding
pairs of genes that have mutually exclusive outlier profiles.
COPA is used for detecting translocations of the second type
using microarray data. The goal of COPA is to identify genes
that have a subset of disease samples with outstanding high/low
values. We have implemented COPA in Spark for production, which
can process millions of biomarkers for one-sided and two-sided
analysis, where each biomarker may have thousands of genes.
Selection of the Spark for COPA implementation was a natural
choice, since Spark offers natural join and filter operations
(main steps in COPA implementation) in a very high level manner,
which is lacking from traditional MapReduce API. This presentation
will show how we used Spark to solve a complex COPA.

Speakers

Mahmoud Parsian

Illumina, Inc.

Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Hadoop, Spark, and distributed... Read More →

Cancer Outlier Profile Analysis Using Spark pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Georgia B

Operations-Use Cases, Intermediate

11:20am PDT

A Java Implementer’s Guide to Boosting Apache Spark Performance - Tim Ellison, IBM

Apache Spark has rocked the big data landscape, becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark’s core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk Tim Ellison, a JVM developer at IBM, shows some of the unique Java 8 capabilities in the JIT compiler, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. Tim will demonstrate how solutions, previously infeasible with regular Java programming, become possible with this high performance Spark core runtime, enabling you to solve problems smarter and faster.

Speakers

Tim Ellison

Tim Ellison is currently a Senior Technical Staff Member with IBM's Java Technology Centre in the UK. He has worldwide responsibility for Open Source Engineering in the Java SDK underpinning a broad selection of IBM's flagship products. He is a Member of the Apache Software Foundation... Read More →

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Georgia B

Operations-Use Cases, Intermediate

2:00pm PDT

Using Apache Big Data Stack to Analyse Storm-Scale Numerical Weather Prediction Data - Suresh Marru, Indiana University

This talk will discuss adaptation of Apache Big Data Technologies to analyze large, self-described, structured scientific data sets. We will present initial results for the problem of analyzing petabytes of weather forecasting simulation data produced as part of National Oceanic and Atmospheric Administration’s annual Hazardous Weather Testbed. The challenge is to enable weather researchers to perform investigative queries over the full forecast simulation outputs to find the signatures for severe weather phenomena like tornadogenesis. Given the size of the data and the complexity of weather phenomena, these data sets are candidates for exploration by machine learning techniques that can identify heretofore unknown relationships in the dozens of weather parameters generated by the simulations, guiding researchers into developing new scientific models.

Speakers

Suresh Marru

Member, Indiana University

Suresh Marru is a Member of the Apache Software Foundation and is the current PMC chair of the Apache Airavata project. He is the deputy director of Science Gateways Research Center at Indiana University. Suresh focuses on research topics at the intersection of application domain... Read More →

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Georgia B

Operations-Use Cases, Intermediate

10:50am PDT

Tailored for Spark - Petr Igrevski, eBay

We went big with Spark at eBay. Let us tell you the story how we built a custom tailored Spark system leveraging cloud and disaggregated storage. Watch us demonstrate our Spark developer experience as we walk you through our custom Spark as a service offering. Come and learn how eBay embraced Spark, how we created a delightful environment for our data developers, and how we use this environment today.

Speakers

Petr Igrevski

Tailored for Spark pdf

Wednesday May 11, 2016 10:50am - 11:40am PDT
Plaza B

Operations-Use Cases, Intermediate

4:10pm PDT

Data Science for the Datacenter: Analyzing Logs with Apache Spark - William Benton, Red Hat, Inc

Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured.

In this session, Will Benton will introduce the log processing domain and give you practical advice for using Apache Spark to analyze log data, including data engineering techniques to impose structure on disparate log sources; data science approaches to detect infrastructure failures; language-processing techniques to characterize the text of log messages; best practices for tuning Spark and using newer Spark features; and how to visualize your results. You’ll learn from Benton’s experience developing applications that analyze the vast log data generated within Red Hat’s network and leave well-prepared to analyze your own logs.

Speakers

William Benton

Manager, Software Engineering and Sr. Principal Engineer, Red Hat, Inc

William Benton leads a team of data scientists and engineers at Red Hat, where he has applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy... Read More →

Data Science for the Datacenter Analyzing Logs with Apache Spark pdf

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Plaza B

Operations-Use Cases, Intermediate

5:10pm PDT

Data Management at Scale - Tom Barber, Meteroite Consulting

Apache OODT is relatively easy to get up and running with the RADiX distribution but how do you administer it at scale?

Managing a data management cluster can be daunting, especially when its distributed around the globe in various data centres. We’ll take a look at options for large scale distributed roll outs of Apache OODT across multiple continents and how to connect, support and administer them, maximising the throughput of the system and ensuring users have access to all the data they require.

Container technology has drastically altered the DevOps landscape, using service orchestration tools and Apache MESOS to maintain your cluster can make managing OODT relatively easy and infinitely scalable and also how to connect it to other data services. Find out more in a live (seat of the pants) demo.

Speakers

Tom Barber

Technical Director, Spicule LTD

Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Plaza B

Operations-Use Cases, Intermediate