Apache: Big Data 2016: Full Schedule

Register Now or Visit the Website for more Information

5:00pm PDT

Registration

Sunday May 8, 2016 5:00pm - 7:00pm PDT
Georgia Foyer

7:30am PDT

Breakfast

Monday May 9, 2016 7:30am - 9:00am PDT
Regency Foyer

7:30am PDT

Registration

Monday May 9, 2016 7:30am - 9:00am PDT
Georgia Foyer

8:00am PDT

Technology Showcase

Monday May 9, 2016 8:00am - 4:10pm PDT
Regency Foyer

9:00am PDT

Welcome & Opening Remarks - Rich Bowen, Executive Vice President, Apache Software Foundation

Speakers

Rich Bowen

Open Source Strategist, AWS

Rich Bowen has been involved in open source since before we started calling it that. He's a member of the Apache Software Foundation, where he currently serves as a board member and VP Conferences. Rich is an Open Source Strategist at AWS.

Monday May 9, 2016 9:00am - 9:05am PDT
Regency CD

Keynote

9:05am PDT

Keynote: How Netflix Leverages Big Data - Brian Sullivan, Director of Streaming Analytics, Netflix

Netflix is the world's leading internet television network. That didn't happen by accident or simple fortune - we are data-driven as part of our culture, and have built the tools needed to navigate the unchartered waters of delivering internet video at scale and becoming the first truly global storyteller in movies and television.

Speakers

Brian Sullivan

Director of Streaming Analytics, Netflix

Brian Sullivan is the Director of the Streaming Data Science and Engineering team at Netflix, the world’s leading Internet television network. His team builds analytic systems and delivers insight into the streaming activity across hundreds of client devices, world-class server... Read More →

Monday May 9, 2016 9:05am - 9:25am PDT
Regency CD

Keynote

9:30am PDT

Keynote: How Enterprises are Leveraging Open Source Analytics Platforms for Making Game Changing Decisions - Luciano Resende, Architect, Spark Technology Center, IBM

In this Keynote, Luciano Resende, Architect, Spark Technology Center at IBM, will showcase Open source Analytic platforms. Luciano will also discuss how they are being leveraged by different organizations to upend their competition, as well as enable new use cases.

Speakers

Luciano Resende

Architect, Spark Technology Center, IBM

Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair for... Read More →

Monday May 9, 2016 9:30am - 9:45am PDT
Regency CD

Keynote

9:50am PDT

Keynote: It Takes a Village: Making Data Projects Work - Amy Gaskins, Big Data Project Director

We've all seen an innovative data project that fails and all too often the reason isn't a lack of technical skill among the team members. In order to succeed in complex organizations, data project teams require both diversity and versatility (and their sources aren't always where you might expect). From the battlefield to the boardroom, Amy's experience demonstrates that incongruous teams can achieve remarkable results.

Speakers

Amy Gaskins

Amy Gaskins has previously worked for NOAA and MetLife. Amy was an Assistant Vice President in MetLife’s Global Technology & Operations, managing data science projects in Europe, the Middle East, and North Africa. In her previous government service, Amy spent over 10 years as... Read More →

It Takes a Village pdf

Monday May 9, 2016 9:50am - 10:10am PDT
Regency CD

Keynote

10:10am PDT

Coffee Break

Monday May 9, 2016 10:10am - 10:40am PDT
Regency Foyer

10:40am PDT

A Faster Way for Faster Workflows - Ken Krugler, Scale Unlimited

Cascading is a popular open source project that makes it easier to create workflows for processing big data. In the past these always ran on top of Hadoop, but now there’s a new option - run them using Flink, a fundamentally stream-oriented dataflow engine that takes full advantage of available RAM.

In this presentation Ken Krugler will briefly describe Flink, and then discuss a real-world example of converting a complex workflow (100+ jobs, NLP processing of text, SVM-based classification, etc) from Hadoop to Flink.

Speakers

Ken Krugler

Scale Unlimited

Ken Krugler is a veteran entrepreneur, developer and instructor. He is the president of Scale Unlimited, a provider of consulting and training services for big data analytics, search, and machine learning using Hadoop, Cascading, Mahout, Cassandra and Solr. Ken is an Apache Tika committer... Read More →

A Faster Way for Faster Workflows pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Plaza C

Faster-Better, Intermediate

10:40am PDT

Seven Habits of a Highly Effective Big Data Programmer - Rekha Joshi, Intuit Inc

With examples of Big data application use cases, this talk will delve into the Seven Habits of a Highly Effective Big Data Programmer
1. Not just linearly: Sometimes doing more of the same helps, while other times what is required is to break the mould
2. Explore: Whenever you are stuck, explore
3. Design and Redesign: Do it in at least 3 different ways;debate
4.Newton SAW the apple: Really observe what happens.Check the metrics, performance, security. Monitor everything! Every technology has a strong card and an achilles heel.Deliberate on fitting!
5.Evaluate and Reevaluate: Technology will change, so will your customer needs
6.Get Your Mathematics KungFu on:To be savvy on napkin mathematics/modeling can get the scare out of big numbers
7.Networking is the King: In distributed computing, its distribution that’s critical.Often the tilt is one who can understand networking better

Speakers

Rekha Joshi

Principal Software Engineer, Intuit

Rekha Joshi is a Principal Software Engineer at Intuit, and is getting amazing work done in finance on Big data ecosystem.Previously at Yahoo!, worked on Apache Hadoop since initial versions.She has worked in diverse domains of advertising, supply chain and research.She is an open... Read More →

Seven Habits Of Highly Effective Big Data Programmers pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Georgia A

Interfacing with Big Data, Any

10:40am PDT

Building Distributed Systems Using Apache Helix - Aditya Auradkar, LinkedIn

Building distributed data systems is hard! Especially systems that store and process vast quantities of data and yet need to be operable. Problems like partition management, resource distribution, state management, leader elections are often solved in isolation by various big data systems. Wouldn’t it be nice if we had one mechanism with which to model these concepts across systems?

Apache Helix aims to do exactly that. It is a generic cluster management framework that can be used to manage resources spread across a pool of nodes. This talk will provide an overview of Apache Helix and how different data systems at LinkedIn leverage it to solve some of their hardest problems in a uniform manner.

Speakers

Aditya Auradkar

Engineering Manager, Uber

Aditya manages the Streaming Data platform team at Uber. Powering pub-sub style event transport, streaming/batch analytics and ingestion are some examples of use-cases. Previously at LinkedIn, he managed the Apache Kafka engineering team and was one of the earliest members of the... Read More →

Building Distributed Systems Using Apache Helix pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Regency A

Managing Distributed Systems, Intermediate

10:40am PDT

How ODPi Leveraged Apache Bigtop to Get to Market Faster (and You Can Too!) - Roman Shaposhnik & Konstantin Boudnik, Pivotal Inc.

Apache Bigtop has always tried to be to Apache Big Data ecosystem what Debian has been to Linux universe, so it is no surprise that ODPi.org has leverage it to produce its first official release. Come get an overview of the origins of Apache Bigtop and why organizations like ODPi, Cloudera, Wandisco, and Amazon Web Services rely on Bigtop for their own bigdata component distribution efforts, and where the project is going post its last 1.0 release. You will also learn how contributions from ODPi members are helping Bigtop get even stronger and provide an integration platform for the next generation big data technologies.

Speakers

Konstantin Boudnik

CEO, Memcore

Dr.Konstantin Boudnik, co-founder and CEO of Memcore Inc, is one of the early developers of Hadoop and a co-author of Apache BigTop, the open source framework and the community around creation of software stacks for data processing projects. With more than 20 years of experience in... Read More →

Roman Shaposhnik

Director of Open Source, Linux Foundation

Apache Software Foundation and Data, oh but also unikernels

Monday May 9, 2016 10:40am - 11:30am PDT
Plaza B

Math & Standards, Intermediate

10:40am PDT

The Evolution of Apache Kylin: Realtime and Plugin Architecture in Kylin2 - Luke Han, Apache Kylin

After successful MOLAP implementation, Apache Kylin’s evolution is turning to enable realtime analysis, and also to support different input and output data sources, leverage different computing engines. In Apache Kylin2, the new designed architecture support plug-able adaptor from Hive/SparkSQL/Kafka and others, and also possible to store data into other storage system rather than HBase, like Kudu. During this session, will introduce the detail of such changes and coming features. Also will cover one production use case with streaming supported already
Agenda:
1. Apache Kylin Overview
2. Plugin Architecture
3. Streaming Cubing
4. Realtime Analysis
5. Use Cases.

Speakers

Luke Han

Co-Founder & CEO, Kyligence

Luke Han is Co-Founder and CEO at Kyligence, and the co-creator and VP of the open source Apache Kylin project, who contributing his passion to driving the project's strategy, roadmap and product design. For past few years he has been working on growing Apache Kylin's community... Read More →

Monday May 9, 2016 10:40am - 11:30am PDT
Regency E

New Projects, Intermediate

10:40am PDT

Big Data DumbOps - Kevin Monroe, Canonical

In this talk, Kevin will explore the idea of Big Data DumbOps -- not "dumb" because standing up a Big Data stack is easy, but "dumb" because it should be. Few people give much thought to apt-get install foo. Why can’t ’foo’ be a Big Data analytics stack, complete with ingestion, processing, and visualization components? The hard part with solving Big Data problems is in the deployment and configuration of all the services that need to work together (NameNodes, DataNodes, ResourceManagers, oh my). Wouldn’t it be great if there was an easy way to model a Big Data platform, stand that up in a cloud, and get down to business? "Yes" is the right answer, and Juju does just that. Kevin will cover some of the Big Data services available in the Juju ecosystem (Hadoop, Spark, Kafka, etc) and then show how easily these can be deployed to make way for the real fun -- solving Big Data problems.

Speakers

Kevin Monroe

Canonical

During his tenure at IBM, Kevin’s projects covered a wide range of development, from tiny embedded operating systems to Linux enablement of the POWER8 hardware platform. Kevin moved to Canonical in 2014 with his focus set on modeling workload deployments at scale. He found his niche... Read More →

Big Data DumbOps pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Georgia B

Operations-Use Cases, Intermediate

10:40am PDT

Apache Hadoop 3 Current Status - Akira Ajisaka, NTT DATA

Do you want Hadoop 3 release? It is over 4 years since Hadoop 3 and Hadoop 2 were diverged, and there are a lot of great improvements in Hadoop 3, such as Shell Script Rewrite and MapReduce Native Optimization. Therefore if Hadoop 3 is released, users can enjoy the benefits of the new features.
In this session, we will introduce the new features and incompatible changes in Hadoop 3, and how the release is discussed in Apache Hadoop community. In addition, Akira Ajisaka would like to discuss releasing Hadoop 3 with the participants here if possible.

Speakers

Akira Ajisaka

Software Engineer, NTT DATA Corporation

Akira Ajisaka is a software engineer working at NTT DATA, Japan. He belongs to OSS Professional Services team and deploys and operates Hadoop clusters for customers. He sometimes troubleshoots them by investigating source code and creating patches to fix the problem. He is an Apache... Read More →

Apache Hadoop 3 Current Status Ajisaka pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Regency B

State-Future of $foo, Intermediate

10:40am PDT

Streaming SQL with Apache Calcite - Julian Hyde, Hortonworks

With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.

Speakers

Julian Hyde

Julian Hyde is an expert in query optimization and in-memory analytics. He is PMC chair of Apache Calcite, an engine for query optimization and data virtualization. He also founded Mondrian, the most popular open source OLAP engine. He is an architect at Hortonworks.

Monday May 9, 2016 10:40am - 11:30am PDT
Plaza A

Streams, Intermediate

11:40am PDT

Migrating Hundreds of Hadoop Pipelines into Docker Containers - Noa Resare, Spotify

Spotify maintains hundreds of big data pipelines built over a number of years, most of which runs one or more transformations on our 1800 node on-premise Hadoop cluster. There has been steady evolution with regards languages, frameworks and development strategies over those years and the result is a highly heterogenous set of pipelines with lots of specific demands the execution environment. To ensure stability while encouraging innovation, we are now leveraging Docker to contain some of the complexity and have a unified interface for the scheduling infrastructure. This talk is all about what we have learned in the process and how Spotify’s experience in running a large fleet of docker containers for production services has helped shape our efforts.

Speakers

Noa Resare

Free Software Ombudsman, Spotify

Noa Resare is a senior engineer and the Spotify Free Software Ombudsman. Noa is an accomplished public speaker has been giving talks at conferences such as Cloud Open, Usenix Lisa and LinuxCon on a wide variety of technical subjects.

Monday May 9, 2016 11:40am - 12:30pm PDT
Plaza C

Faster-Better, Intermediate

11:40am PDT

Building Large Scale Applications in Apache Hadoop YARN with Apache Twill - Henry Saputra & Terence Yim, Apache Software Foundation

Apache Twill incubating is new Apache incubator project that provides higher level abstraction to build distributed systems applications on top of Apache Hadoop YARN. Developing distributed applications using YARN is hard because YARN does not provide higher level APIs and lots of boiler plate code that need to be duplicated to deploy the application. Developing YARN applications usually done by framework developers such as from Apache Flink or Apache Spark developers who need to leverage YARN as resource management for deploying the framework in distributed way.
Using Apache Twill, application developers just need to know basic concept of Java programming model when using the Apache Twill APIs so they can focus solving business problems. In this talk I would like to also present example of Cask Data Application Platform (CDAP) that heavily use Apache Twill as resource management

Speakers

Henry Saputra

Software Engineer, ASF

Member of the Apache Software Foundation (ASF) PMC, Committer, and contributor to several Apache Software Foundation projects.

Terence Yim

Software Engineer, Cask Data Inc.

Terence Yim is a Software Engineer at Cask, responsible for designing and building the Cask Data Application Platform (CDAP). He is also the lead developer and PPMC of the Apache Twill and the Apache Tephra projects. Prior to joining Cask, Terence worked at both LinkedIn and... Read More →

Building Large Scale Applications in Apache Hadoop YARN with Apache Twill pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Georgia A

Interfacing with Big Data, Intermediate

11:40am PDT

We're Watching You: An Introduction to the Apache Unomi Project - Nikhil Patel, Jahia

Nikhil Patel of Jahia, will present an example of tracking that respects the customers’ privacy. This presentation will first address the problem of existing tracking and privacy policies. In addition, Nikhil will provide an introduction to the new Apache Unomi project, currently in incubation. Unomi is a reference implementation of the OASIS ConteXt Server specification aimed at standardizing personalization technologies while addressing privacy issues at the same time.

Speakers

Nikhil Patel

Sr. Solutions Architect, Jahia

We are watching you (Apache Unomi) by Nikhil Patel pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Regency A

Managing Distributed Systems, Any

11:40am PDT

Using the SDACK Architecture to Build a Big Data Product - Yu-Hsin Yeh, Trend Micro

You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.

Speakers

Evans Ye

ASF member, Apache Bigtop Committer/PMC member/Former VP, Director of Taiwan Data Engineering Association, Apache Software Foundation

Yu-Hsin Yeh(Evans Ye) is former VP, and currently committer and PMC member of Apache Bigtop. He loves to code, automate things, and tackling big data challenges. Aside from engineering stuff, he is also an enthusiast in giving talks to share software innovations and cutting-edge technologies... Read More →

Using the SDACK Architecture to Build a Big Data Product pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Plaza B

Math & Standards, Any

11:40am PDT

Apache Trafodion Brings Operational Workloads to Hadoop - Rohit Jain, Esgyn

Apache Trafodion is a world class Transactional SQL RDBMS running on HBase/Hadoop, currently in Apache incubation.

In this talk we will discuss:
• How operational workloads are different from BI and analytical workloads
• The operational (OLTP & Operational Data Store) use cases Trafodion addresses
• Why Trafodion is the right solution for these use cases. That is, what is the recipe for a world class database engine, and how Trafodion implements the ingredients that make up that recipe:
1. Time, money, and talent!
2. World class query optimizer
3. World class parallel data flow execution engine
4. World class distributed transaction management system
• Other important aspects such as performance, scale, availability, and future directions

Speakers

Rohit Jain

CTO, Esgyn

Rohit Jain is Co-Founder and CTO at Esgyn, an open source database company. Rohit provided the vision behind Apache Trafodion, an enterprise-class MPP SQL Database for Big Data, donated to the Apache Software Foundation by HP in 2015. A veteran database technologist over the past... Read More →

Apache Trafodion Brings Operational Workloads to Hadoop Jain pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Regency E

New Projects, Intermediate

11:40am PDT

Big Data in Biology - Omkar Reddy, Dirubhai Ambani University, India

Big Data in Biology - Big data has been the key player in data mining and analytics. Large amount of data is shared and dumped on the internet everyday. Our body can be considered as a big mine of data that is being explored since years. In this presentation we will be viewing recent discoveries and implementations of big data in biology. We will see the amount of data a single genome of a single protein produces and how it is very useful to study and find cure and preventions for many diseases and viruses. We will also be looking at different sectors of biology and the data produced in each of the sectors and the significance of big data in studying this data. We will also be looking at potential technologies that can be engineered using big data analytics tools and data mining to predict, track and cure diseases. We will also take a look at how big data influenced synthetic biology.

Speakers

Omkar Reddy

Dirubhai Ambani Institute of Information and Communication Technology

I am Omkar Reddy, a student pursuing my B.Tech in Information and Communication Technology in Dirubhai Ambani University, India. I am fond of computers science and algorithms. I have been involved with the Apache Open Climate Workbench project. I am a member of the Project Management... Read More →

Big Data in Biology pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Georgia B

Operations-Use Cases, Beginner

11:40am PDT

Recent Development in HBase - Zhihong Yu, Hortonworks

HBase has been powering variety of applications in the past 8 years.
In this presentation, I will talk about the following recent developments:

Enhancement to compaction: Properly selecting / tuning compaction strategy is at the heart of providing consistent performance. FIFO compaction policy collects expired store files. Since no real compaction is done, we do not use CPU and IO (disk and network). This results in improved throughput and latency for both write and read.

Region normalization feature: This serves times series data well (in combination with FIFO compaction policy) where non-default TTL is specified. As aging data is archived, adjacent empty regions are continuously merged. This keeps table in well managed shape.

Bulk Loaded HFile Replication: HBase replication is enhanced to support replication of bulk loaded data. This completes disaster tolerance scenario.

Speakers

Zhihong Yu

Staff Engineer, VMware

I have been Apache HBase PMC for 5 and half years.I am also committer for Apache Slider and Apache Bahir.I contribute to Apache Phoenix and Apache Spark.I have presented at the past 3 ApacheCon NA events.

Recent Development in HBase pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Regency B

State-Future of $foo, Advanced

11:40am PDT

SAMOA: A Platform for Mining Big Data Streams - Nicolas Kourtellis, Telefonica

In this talk, Nicolas Kourtellis will introduce Apache SAMOA (Scalable Advanced Massive Online Analysis), an open-source platform for mining big data streams (http://samoa.incubator.apache.org). Apache SAMOA provides a collection of distributed streaming algorithms for data mining tasks such as classification, regression, and clustering. The models built can be updated as new data arrive without the need to define data batches or update frequencies. The platform features a pluggable architecture that can run on existing and well-tested distributed stream processing engines such as Storm, S4, Samza and Flink, for scalability and fault tolerance.

Speakers

Nicolas Kourtellis

Researcher, Telefonica I+D

Nicolas Kourtellis is a Researcher at Telefonica Research. Previously he was a Researcher in the Web Mining Research Group at Yahoo Labs, Barcelona. He holds a Ph.D. in Computer Science and Engineering from the University of South Florida (2012), a MSc in Computer Science from the... Read More →

SAMOA ApacheCon BigData NA 2016 pdf

Monday May 9, 2016 11:40am - 12:30pm PDT
Plaza A

Streams, Intermediate

12:30pm PDT

Lunch (Attendees on Own)

Monday May 9, 2016 12:30pm - 2:00pm PDT
TBA

2:00pm PDT

Accelerating Cloud with FPGAs - Eric Fukuda, University of Tornoto

In our project, we are trying to make easy to use, scalable multi-FPGA fabrics available in data centers. There are several recent projects that try to employ FPGAs for accelerating data centers applications. However, those projects focus on accelerating specific applications rather than making their platforms usable for general developers. To make FPGAs available for various application developers, we are trying to virtualize FPGAs in data centers. We use OpenStack for allocating FPGAs placed in a data center, Apache Zookeeper to distribute the jobs across the FPGAs, and Apache Drill as a prospective application to use distributed FPGAs. As work in progress, our first goal was to achieve functionality. We have observed good scalability and expect the performance to improve as we incorporate SQL acceleration techniques for FPGAs.

Speakers

Eric Fukuda

University of Toronto

Eric is a Postdoctoral Fellow at the Department of Electric and Computer Engineering, University of Toronto. During his Ph.D. at Hokkaido University, he worked on a project to accelerate memcached with an FPGA. He is interested in accelerating large-scale databases with FPGAs.

Monday May 9, 2016 2:00pm - 2:50pm PDT
Plaza C

Faster-Better, Any

2:00pm PDT

SMACK Stack - Data Done Right - Stefan Siprell, codecentric AG

A talk covering the best-of-breed platform consisting of Spark, Mesos, Akka, Cassandra and Kafka. SMACK is more of a toolbox of technologies to allow the building of resilient ingestion pipelines, offering a high degree of freedom in the selection of analysis and query possibilities and baked in support for flow-control. More and more customers are using this stack, which is rapidly becoming the new industry standard for Big Data solutions.

Speakers

Stefan Siprell

General Manager, codecentric AG

Anything which required integrating has been integrated by Stefan in his career. Currently he is working as an Architect at codecentric. There is projects have become much more demanding, which resulted in more resilient plattforms supporting previously unimaginable data transfer... Read More →

Monday May 9, 2016 2:00pm - 2:50pm PDT
Georgia A

Interfacing with Big Data, Intermediate

2:00pm PDT

Apache HAWQ Resource Management and Integration with YARN - Yi Jin, Pivotal

Apache HAWQ resource management and integration with YARN(Yi Jin, Pivotal) - Apache HAWQ is an advanced analytics MPP database for enterprises, featuring exceptional analytics performance, robust ANSI SQL compliance and Hadoop ecosystem integration and manageability. The integration of Apache HAWQ and YARN is the key point to the cluster scaling flexibility and on-demand resource consumption when driving high performance concurrent MPP analytics. In this presentation, Yi Jin will introduce resource management and fault-tolerance components of Apache HAWQ including its internal / external behaviours, architecture, and important technical challenges dealt by Apache HAWQ due to the conflicts between the reality of dynamic cluster with restricted variable resource and the expectation of high query performance and concurrency.

Speakers

Yi Jin

Pivotal

Yi Jin obtained his Ph.D degree of computer theory and software from Beihang University (BUAA) in 2009. He is now working for Pivotal and is mainly responsible for contributing and maintaining Apache HAWQ resource management and fault tolerance components. He has seven-year developing... Read More →

Apache HAWQ Resource Management and Integration with YARN pdf

Monday May 9, 2016 2:00pm - 2:50pm PDT
Regency A

Managing Distributed Systems, Any

2:00pm PDT

Convergence Rank and It’s Applications - Dalmo Cirne, mParticle, Inc.

In this paper we explore an algorithm to determine the relevance of each item in a finite set of the items in reference to each other, where in order to address an item you have to first go through a convergence or proxy item. If we imagine a media streaming company (convergence item) and all its available genres for playback (items in a finite set), how relevant is each music genre at different moments in time? Or a sports media company and the covered sports, how does the relevance of each sport changes throughout the year as sports seasons begin and end?

The applications of this algorithm are immediate and plentiful in possibilities: When should investments be made during the lifecycle of a sports season? Where to allocate resources in a financial portfolio based of ranks and trends? How to compensate artists given the relevance of the traffic they are generating?

Speakers

Dalmo Cirne

Senior Director of Mobile Engineering, mParticle, Inc.

Dalmo Cirne is a software engineer and mathematician with more than 10 years of experience (in startups and large corporations) creating applications from its conception to architecture definition, development, deployment, and adoption by users worldwide.

Convergence Rank and its applications pdf

Monday May 9, 2016 2:00pm - 2:50pm PDT
Plaza B

Math & Standards, Intermediate

2:00pm PDT

Data Science Applied: A Utilities Sector Case Study - Bram Steurtewagen

Automated Metering Infrastructure (AMI) is gaining traction within the utilities sector and has brought with it numerous improvements in all related fields. Specifically in tariff setting and demand response models, classification of smart meter readings into load profiles helps find the right segments to target. The methodology explained in this tutorial combines commercial, government and open data with the internal company data to accurately predict the load profile of a new customer using high performing classification models in both R and PySpark. Load profiles are generated using a clustering algorithm and are subsequently used as the dependent variable in our classification model. The results of this model are then scored and interpreted in a business context. During the entire process, possible business hurdles will be identified and solutions will be offered.

Speakers

Bram Steurtewagen

Ghent University

Bram Steurtewagen received his M.Sc. degree in Commercial Engineering (2013) and his M.Sc. degree in Marketing Analytics (2014) from Ghent University in Belgium. Since then, he has been pursuing a PhD in Marketing Analytics at the Faculty of Economics and Business Adminstration of... Read More →

Data Analytics Applied pdf

Monday May 9, 2016 2:00pm - 2:50pm PDT
Regency E

New Projects, Beginner

2:00pm PDT

Standing on Shoulders of Giants: Ampool Story - Milind Bhandarkar, Ampool, Inc.

Today, if unforeseen events change the decision model, we wait until the next batch model build for new insights. By extending fast “time-to-decisions” into the world of Big Data Analytics to get fast “time-to-insights”, applications will get what used to be batch insights in near real time. Enabling this is technology such as smart in-memory data storage, new storage class memory, and products designed to do one or more parts of an analysis pipeline very well. In this talk we describe how Ampool is contributing to, and building upon Apache Geode & several other ASF projects (in Open Data Platform Initiative, ODPi) to allow Big Data analysis solutions to work together with a scalable smart storage class memory layer to allow fast & complex end to end pipelines to be built- closing the loop and providing dramatically lower time to critical insights.

Speakers

Milind Bhandarkar

Founder, Ampool

Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system. Parallel programming languages and paradigms has been his area of focus for over 20 years. He worked at several HPC companies, Yahoo... Read More →

Standing on Shoulders of Giants Ampool Story pdf

Monday May 9, 2016 2:00pm - 2:50pm PDT
Georgia B

Operations-Use Cases, Intermediate

2:00pm PDT

Apache Bigtop: Overview and 2016 Community Update - Nate DAmico, Reactor 8 & Konstantin Boudnik, Memcore

Apache Bigtop is setting the standard for the integration, testing and deployment of the leading open source Big Data components. In this presentation we will give an introductory overview of Bigtop and its origins, including usage in the wild with various commercial vendors and industry standards based on it. We will also cover how the project is evolving in 2016 and where the community is driving it with lots of new participating components being added, such as Apache Apex (Incubating), Apache Zeppelin (Incubating) and Flink.

Speakers

Konstantin Boudnik

CEO, Memcore

Nate DAmico

Nate has been working in the enterprise and mobile software industry for 14 years in various capacities. In recent years his tech efforts have focused around areas of mobile computer vision as well as the rise of the consumerization of IT Operations. Three years ago he started Reactor8... Read More →

Monday May 9, 2016 2:00pm - 2:50pm PDT
Regency B

State-Future of $foo, Any

2:00pm PDT

Will It Scale? The Secrets Behind Scaling Stream-processing Applications - Navina Ramesh, LinkedIn

Scaling stream processing applications is sometimes seen akin to scaling batch processing applications. You may re-partition your input stream to scale throughput, similar to re-sharding a batch. However, it becomes challenging for "stateful" applications to “stay realtime”, as they frequently require fault-tolerant state-management. Providing low-latency, fault-tolerant processing for high-volume input streams is fundamentally governed by the state-management primitives provided by the stream processing systems. In this talk, we will discuss how such stateful applications are supported in the open-source stream-processing systems, such as Apache Flink, Spark Streaming and Apache Samza. We will, then provide a deep-dive on Apache Samza’s approach for state-management and fault-tolerance and discuss how it can be effectively used to scale stateful applications.

Speakers

Navina Ramesh

Navina Ramesh started her career in Yahoo! India, where she contributed on scaling the Yahoo! Search clusters for 3 years. At LinkedIn, she has worked on developing the Feed Personalization pipeline and improved the caching and pagination models in the Feed Infrastructure. She has... Read More →

Will it Scale Final pdf

Monday May 9, 2016 2:00pm - 2:50pm PDT
Plaza A

Streams, Beginner

3:00pm PDT

Happier Developers and Happier Software Through Distributed Testing - Andrew Wang, Cloudera

A thorough unit test suite is a positive indicator of software quality. However, as the size of a test suite grows, its runtime can span hours or even days, to the point that it is unwieldy to run the full suite. Also, test runs at this scale are unlikely to succeed due to intermittent test failures. Together, these issues make the test suite less accessible to developers, which lowers developer productivity and decreases software quality.

Distributed testing offers a solution to these issues. Using our open-source distributed test infrastructure, we are able to speed up Apache Hadoop’s test suite by approximately 100x. This same framework is also in use by Apache HBase and Apache Kudu (incubating), with further projects planned.

In this talk, I will describe our distributed testing framework, how we use it at Cloudera, and how you could use this same framework on for your project.

Speakers

Andrew Wang

Software Engineer, Cloudera

Andrew Wang is a software engineer at Cloudera on the HDFS team, where he has worked on projects including in-memory caching, transparent encryption, and erasure coding. Previously, he was a PhD student in the AMP Lab at UC Berkeley, where he worked on problems related to distributed... Read More →

distributed testing apache big data 2016 pptx

Monday May 9, 2016 3:00pm - 3:50pm PDT
Plaza C

Faster-Better, Intermediate

3:00pm PDT

Druid: Interactive Exploratory Analytics at Scale - Fangjin Yang, Imply

Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks sub-optimal choices to power interactive applications. Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies, but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss using Druid for analytics, and why the architecture is well suited to power analytic dashboards.

Speakers

Fangjin Yang

CEO, Imply

Fangjin is a co-author of the open source Druid project and a co-founder of Imply, a San Francisco based technology company. Fangjin previously held senior engineering positions at Metamarkets (acquired by Snap, Inc.) and Cisco. He holds a BASc in Electrical Engineering and a MASc... Read More →

Monday May 9, 2016 3:00pm - 3:50pm PDT
Georgia A

Interfacing with Big Data, Intermediate

3:00pm PDT

Managing Enterprise Hadoop Clusters with Apache Ambari - Jayush Luniya, Hortonworks

The primary stakeholders of an enterprise Hadoop environment need capabilities beyond simple provisioning of clusters. Apache Ambari is a 100% open-source platform for provisioning, managing, monitoring and upgrading of Hadoop clusters. Apache Ambari is being extensively used by the enterprise and Apache community as a cluster management platform.

In this session, you will learn how to use Apache Ambari for managing large scale Hadoop clusters. You will be introduced to a wide range of Ambari capabilities: Stack Orchestration, Stack Upgrades, Blueprints, Kerberos Automation, Ambari Metrics, Alerts and Views framework. This talk will include technical deep dive into the sub-system architectures, along with short demonstrations to showcase these features. Finally, we will discuss about the long-term roadmap and new capabilities planned in future Apache Ambari releases.

Speakers

Jayush Luniya

Hortonworks

Jayush Luniya is a Technical Lead on the Ambari team at Hortonworks. He is an active Apache Ambari committer and Apache Ambari PMC member. He has made significant contributions to Ambari upgrade framework, Ambari stack orchestration and enabling Ambari support on Windows. At present... Read More →

Managing Enterprise Hadoop Clusters with Apache Ambari pdf

Monday May 9, 2016 3:00pm - 3:50pm PDT
Regency A

Managing Distributed Systems, Any

3:00pm PDT

Big Data Analytics Using R and PySpark for Business, Finance and Marketing - Dirk Van den Poel, Ghent University

In this talk, we share our experience in researching and practicing Business Analytics with a strong emphasis on predictive and prescriptive analytics. We present our findings using a series of platforms ranging from (1) large shared memory systems (e.g. for open-source R code, #rstats), over (2) dedicated Apache Spark clusters using Python Jupyter Notebooks to (3) very large HPC settings with hundreds of nodes (using HOD, https://github.com/hpcugent/hanythingondemand).
More specifically, we discuss our experience (a) running huge equity price-direction prediction models for S&P 100 stocks, (b) analyzing analytical CRM databases for a large retailer, and (c) researching the link between tweets and weblog data.

Speakers

Dirk Van den Poel

Professor of Data Analytics, Ghent University

Dirk Van den Poel (PhD) is Senior Full Professor of Data Analytics/Big Data at Ghent University, Belgium. He teaches courses such as Statistical Computing, Big Data, Predictive and Prescriptive Analytics. He co-authored 80+ international peer-reviewed publications in journals such... Read More →

UGent at Apache Big Data North America May 2016 v16 doorgestuurd b pdf

Monday May 9, 2016 3:00pm - 3:50pm PDT
Plaza B

Math & Standards, Intermediate

3:00pm PDT

Introduction to Apache Kudu (Incubating) for Timeseries Storage - Dan Burkert, Cloudera

Apache Kudu (Incubating) is a new columnar storage engine for the Hadoop
ecosystem. Kudu is designed to handle the stresses of the modern analytics
pipeline, enabling real time ingestion with instant querying capability at
scale.

This talk will introduce Kudu, giving an overview of the architecture and
internals. After discussing what makes Kudu different than existing Hadoop
storage platforms, we will discuss why Kudu is particularly well suited for
storing and querying large timeseries datasets. The talk will conclude by
demonstrating a realtime timeseries analytics dashboard powered by Kudu.

Speakers

Dan Burkert

Dan Burkert is a software engineer at Cloudera and committer on Apache Kudu (Incubating). Prior to joining Cloudera, Dan worked on data processing pipelines for machine learning, search, and analytics. Dan received his bachelor’s degree from the University of Virginia.

Introduction to Apache Kudu pdf

Monday May 9, 2016 3:00pm - 3:50pm PDT
Regency E

New Projects, Intermediate

3:00pm PDT

Netflix Keystone - How We Built a 700B/day Stream Processing Cloud Platform in a Year - Peter Bakas, Netflix

Keystone processes over 700 billion events per day with at-least once processing semantics in the cloud. We will explore in detail how we have modified and leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in the Amazon AWS cloud within a year.

* Pipeline Evolution & Architecture
* Why we chose Kafka, Samza, Docker
* How to effectively use these technologies together in the Cloud
* Alterations to Kafka & Samza
* Scaling and Managing Kafka, Samza & Docker
* Deployment & Monitoring details
* Fault tolerance and failover strategies
* Performance numbers

Speakers

Peter Bakas

Netflix

Peter leads the Real Time Data Infrastructure team at Netflix. His team is responsible for building common infrastructure to collect, transport, aggregate, process and visualize over 700 billion events a day. Prior to Netflix, Peter has built and led teams responsible for developing... Read More →

How We Built a Stream Processing Cloud Platform pdf

Monday May 9, 2016 3:00pm - 3:50pm PDT
Georgia B

Operations-Use Cases, Advanced

3:00pm PDT

Dockerized Hadoop Platform and Recent Updates in Apache Bigtop - Amir Sanjar, IBM & Yu-Hsin Yeh, Trend Micro

Apache Bigtop is a project focuses on packaging, testing and configuration management solutions all around the Hadoop ecosystem. In this presentation, we’ll talk about how Bigtop Provisioner integrated with Docker Swarm, Docker Compose, and Docker Machine to give you the ability to run a fully distributed Hadoop cluster on Docker anywhere. In addition, the newly developed image pre-build feature substantially improves the user experience by cutting down the provisioning time to less than a minute. In the past few month, another excited work happened in Bigtop is the IBM PowerPC integration. So, to sum up the content of this talk:
1) How Bigtop Provisioner integrated with Docker ecosystem to achieve multi-host Hadoop cluster deployment.
2) The integration of IBM PowerPC with Apache Bigtop.
3) Newly added Hadoop ecosystem components and some new features we’ve developed recently.

Speakers

Amir Sanjar

Sr. Software Eng, IBM - Apache Bigtop PMC

Amir Sanjar has many years of experience in big data software and solution development at companies including IBM and Canonical. He is the inventor of several patents in areas of enterprise solution automation and wireless/cell technology. Currently, he leads big data ecosystem and... Read More →

Evans Ye

ASF member, Apache Bigtop Committer/PMC member/Former VP, Director of Taiwan Data Engineering Association, Apache Software Foundation

Dockerized Hadoop Platform and Recent Updates in Apache Bigtop pdf

Monday May 9, 2016 3:00pm - 3:50pm PDT
Regency B

State-Future of $foo, Intermediate

3:00pm PDT

Generating Many Resources from One Set of Schemas with Apache Streams - Steve Blackmon, People Pattern

Apache has many good programming languages, databases, and analytics libraries. Most have some unique competency or value that justifies their application in certain situations. Use the right tool for the right job. However, mastering the data definition file formats of multiple platforms and keeping representations of your data (and partner data) current can be challenging and tedious.

Apache Streams (incubating) contains libraries and patterns for specifying, publishing, and inter-linking data schemas, and can convert data between the representation, format, and encoding preferred by supported platforms. The talk will cover using Streams to specify your object schemas, bind them across languages (Java, Scala), serializations (JSON, XML), databases (Cassandra, Elasticsearch, Mongo, HBase), and analytics tools (Spark, Pig, Hive), as well as re-use object definitions created by others.

Speakers

Steve Blackmon

VP Technology, People Pattern, Inc.

VP Technology at People Pattern, previously Director of Data Science at W2O Group, co-founder of Ravel, stints at Boeing, Lockheed Martin, and Accenture. Committer and PMC for Apache Streams (incubating). Experienced user of Spark, Storm, Hadoop, Pig, Hive, Nutch, Cassandra, Tinkerpop... Read More →

Generating Many Resources from One Set of Schemas with Apache Streams pdf

Monday May 9, 2016 3:00pm - 3:50pm PDT
Plaza A

Streams, Advanced

3:50pm PDT

Coffee Break

Monday May 9, 2016 3:50pm - 4:10pm PDT
Regency Foyer

4:10pm PDT

Delivering Realtime and Agile Analytics Using Apache Kafka, Spark & Drill - Neeraja Rentachintala, MapR technologies

Data is the biggest asset in modern organizations to enable building value added products and services as well as optimizing operations. Real time analytic pipelines with a messaging system such as Apache Kafka to capture the data followed by a general purpose transformation layer such as Spark to process and analyze it have become the prominent infrastructure to deliver relevant and timely information to variety of users and applications. Given the extreme diversity of the data sources,an additional consideration for such pipelines is having agility in being able to adapt to changes to the underlying structure of data without incurring lot of development costs and missing SLAs. In this session, Neeraja will cover how Apache Drill’s ability to query complex and dynamically evolving datasets can compliment these solutions and new use cases enabled by using Drill, Spark and Kafka together.

Speakers

Neeraja Rentachintala

Director of Product Management, MapR technologies

As Sr Director of Product Management, Neeraja is responsible for the product strategy, roadmap and requirements of MapR SQL initiatives. Prior to MapR, Neeraja held numerous product management and engineering roles at Informatica, Microsoft SQL Server, Oracle and Expedia.com, most... Read More →

Monday May 9, 2016 4:10pm - 5:00pm PDT
Plaza C

Faster-Better, Beginner

4:10pm PDT

Next Gen Big Data Analytics with Apache Apex - Thomas Weise, DataTorrent

Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.

Speakers

Thomas Weise

CTO, Atrato.io

Thomas is Apache Apex PMC Chair and CTO at Atrato. Prior to founding Atrato he was Architect at DataTorrent and lead the development of Apex from the beginning of the project. Before that he was member of the Hadoop Team at Yahoo! and contributed to several of the big data ecosystem... Read More →

Next Gen Big Data Analytics with Apache Apex pdf

Monday May 9, 2016 4:10pm - 5:00pm PDT
Georgia A

Interfacing with Big Data, Intermediate

4:10pm PDT

YARN: A Resource Manager for Analytic Platform - Tsuyoshi Ozawa, NTT

Hadoop Yet Another Resource Negotiator (YARN) is a resource manager for processing big data. YARN can run various major distributed processing frameworks including not only MapReduce, but also Spark, Tez, Flink and so on to support various workloads.
I will talk about the architecture of YARN, how YARN manages resources in a cluster and how YARN is integrated with these processing frameworks. In particular, I will introduce the best practice to maximize the throughput of the processing frameworks on YARN with the optimization techniques of Spark on YARN and Tez on YARN as examples. I will talk about the points where our YARN community needs more helps and feedback.

Speakers

Tsuyoshi Ozawa

NTT

I’m a Research Engineer on topics in distributed computing working at NTT(Nippon Telegraph and Telephone corporation), which is one of the largest carrier company in Japan. I’ve been a committer and PMC on Apache Hadoop project. Prior to working on Hadoop, I researched the time... Read More →

YARN A Resource Manager for Analytic Platform pdf

Monday May 9, 2016 4:10pm - 5:00pm PDT
Regency A

Managing Distributed Systems, Intermediate

4:10pm PDT

ODPi: Advancing Open Data for the Enterprise - A Panel Discussion Moderated by Roman Shaposhnik, Pivotal Inc.

This panel will be an opportunity for members of the Open Data Platform Initiative to share the benefits of ODP with the Apache community.

Moderators

Roman Shaposhnik

Director of Open Source, Linux Foundation

Apache Software Foundation and Data, oh but also unikernels

Speakers

Milind Bhandarkar

Founder, Ampool

Alan Gates

Co-founder and Architect, Hortonworks

Alan is a founder of Hortonworks and an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan has done extensive work in Hive, including adding ACID transactions. Alan has a BS in Mathematics from... Read More →

Susan Malaika

Susan Malaika is Senior Technical Staff in IBM’s Open Technologies team focusing on data initiatives. Her background spans software development, data modeling, open data, creating and delivering workshops. Susan loves reading, writing and participating in meet-ups.

John Mertic

Director of Program Management, The Linux Foundation

John Mertic is the Director of Program Management for The Linux Foundation. Under his leadership, he has helped ASWF, ODPi, Open Mainframe Project, and R Consortium accelerate open source innovation and transform industries. John has an open source career spanning two decades, both... Read More →

Monday May 9, 2016 4:10pm - 5:00pm PDT
Plaza B

Math & Standards, Beginner

4:10pm PDT

Everyone Plays: Collaborative Data Science with Zeppelin - Trevor Grant, Market6

Data Science is best played as a team sport. Zeppelin facilitates this collaboration via a web based notebook interface to state-of-the-art big data (Flink, Spark, Hive, Cassandra, and many more), with custom visualization powered by AngularJS built in. Markdown allows for rich notation in-line with the code. Work can be shared seamlessly across the organization. Further, interactive visualizations can be shared with business analysts and sales reps, great for prototyping and proof of concepts. But the collaboration also runs between technologies, by leveraging the Zeppelin Context sharing variables BETWEEN contexts. E.g. the results of a Flink paragraph can be passed to a Spark paragraph; the best tool can be used for the job can be used at each step in analytics pipeline and a data scientist who loves Scala Flink can easily work with a data scientist who loves pyspark.

Speakers

Trevor Grant

Director of Developer Relations, Arrikto

Trevor is the Director of Developer Relations at Arrikto and an international speaker excited to be back on the road after a 2 year COVID hiatus. He is also a member and involved with leadership of several projects at the Apache Software Foundation, PMC Chair of Apache Mahout, and... Read More →

Everyone Plays Collaborative Data Science with Zeppelin pdf

Monday May 9, 2016 4:10pm - 5:00pm PDT
Regency E

New Projects, Intermediate

4:10pm PDT

Data Science with News Headlines - Analyzing and Visualizing a Whole Decade - Christian Winkler & Stephanie Fischer, mgm Technology Partners GmbH

We will show how to use Apache tools to dig through unstructured text, analyze and visualize the data and iterate this to create a compelling visual experience and drill down the data. As visualizations we will use word clouds and histograms from D3.js. We will explain the whole procedure from getting, preparing and indexing real-world data to formulating the query and using the results.

Then we will present the animated visualization of the above data. We will demonstrate the flexibility of our Big Data solution and show how it can be adopted to specific users’ needs in realtime. Data-wise, we will also visualize Hacker News. We will elaborate on both the benefits and the limits of word clouds as a Big Data visualization tool.

We will give an outlook to possible other use cases (like mailing list mood detection, wikis etc.) and talk about detecting trends and outliers.

Speakers

Stephanie Fischer

Big Data, Agile and Change Management, mgm consulting partners

I concentrate on user-centricity of Big Data technologies. My focus is finding the questions really worth solving. I think Big Data has the potential to advance humanity into a desirable direction. I have a background in organizational development, agility and business analytics... Read More →

Christian Winkler

Enterprise architect, mgm technology partners GmbH

Christian has worked for 20 years with Internet technologies. Recently, he has focused on working with large amounts of data or many users. As big data applications become more and more popular, lots of applications evolve. Many aggregates have to be calculated to describe charcteristics... Read More →

data science with news headlines pdf

Monday May 9, 2016 4:10pm - 5:00pm PDT
Georgia B

Operations-Use Cases, Beginner

4:10pm PDT

On the Bleeding Edge - Cassandra 3.4 and Beyond - Jonathan Haddad, Datastax

Cassandra is recognized as the best distributed database leveraging continuous availability and partition-tolerance for global deployments. With a strong open source history that began at Facebook to solve problems of absurdly massive scale, Cassandra has grown to be a huge project with a bright future. In this talk we will unpack exactly what that future is all about. With a brand new, high performance Secondary Index implementation, SSTable encryptions, a paradigm shift in architecture moving away from SEDA and towards threads per core, Materialized Views and Aggregations, Cassandra is maturing as a powerful front-runner on the bleeding edge of the NoSQL space.

Speakers

Jon Haddad

Apache Cassandra Committer & Tech Consultant, Rustyrazorblade Consulting

Jon Haddad is a Cassandra committer and a member of the Cassandra PMC. He’s worked with Cassandra at startups, DataStax, The Last Pickle, Apple, and now Netflix. Jon has twenty years as a software developer and database operator, has been using Cassandra for ten years and is widely... Read More →

On the Bleeding Edge Cassandra 3.4 and Beyond pdf

Monday May 9, 2016 4:10pm - 5:00pm PDT
Regency B

State-Future of $foo, Intermediate

4:10pm PDT

Designing Workflows with OODT - Tom Barber, Meteroite Consulting

When building a data management platform, flexible and effective workflows are key to the scalability and effectiveness of the platform.

OODT (originally developed by NASA JPL) has a very flexible and powerful workflow engine and is at the core of pretty much any data processing you will do within the platform but understanding it can sometimes be a challenge.

In this talk we’ll take a deep dive into guts of workflows inside OODT using CAS PGE to help lower the barrier for entry. We’ll run through a number of real world examples. How you build them, how you deploy and trigger them.

We’ll also look at monitoring and feedback. Lastly we’ll tackle resource management and how you make sure your workflows run in the correct server pool, without swamping your resources.

Speakers

Tom Barber

Technical Director, Spicule LTD

Tom Barber is the director of Meteorite BI and Spicule BI. A member of the Apache Software Foundation and regular speaker at ApacheCon, Tom has a passion for simplifying technology. The creator of Saiku Analytics and open source stalwart, when not working for NASA, Tom currently deals... Read More →

Monday May 9, 2016 4:10pm - 5:00pm PDT
Plaza A

Streams, Advanced

5:10pm PDT

Building While Flying: Lessons Learned from Operating and Developing a Graph Service with TinkerPop - Keith Lohnes & David Pitera, IBM

Apache TinkerPop is an open source graph computing framework which uses Gremlin, a domain-specific language for graphs mutation and traversal. IBM Graph offers an Apache TinkerPop3 compatible API as a service. This service can be used for building recommendation engines, analyzing social networks, fraud detection and more. During this session, we will cover:

What’s a Graph and why use it
Challenges faced and lessons learned while building and operating a service based on TinkerPop3 stack

Speakers

Keith Lohnes

Software Engineer, IBM

Keith Lohnes graduated from Northeastern University with a Degree in Computer Science and Music and has been working as a developer for 7 years. He started working with graph databases about 2 years ago and joined IBM to work on their IBM Graph offering.

David Pitera

Software Engineers, IBM

David works on JanusGraph on Compose, a managed graphdb cloud solution using Scylla as the primary data store... Read More →

Building While Flying Lessons Learned pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Plaza C

Faster-Better, Any

5:10pm PDT

Hands-on Apache NiFi - Oleg Zhurakousky, Hortonworks

While Apache NiFi provides out-of-the-box support to build powerful and scalable directed graphs of data routing, transformation, and system mediation logic, some times "the world is not enough".
Roll up your sleeves and put your hands on the keyboard as this hands-on talk structured as a set of quick tutorials will take you through a journey of developing in NiFi. It will cover extension points such as Processor, ControllerService, ReportingTasks as well as other less known areas of NiFi internals, sharing some tips and tricks along the way.

Speakers

Oleg Zhurakousky,

Hortonworks

Open source practitioner with over 17 years of experience in software engineering across multiple disciplines including Big Dada, software architecture and design, consulting, business analysis and application development. Speaker who presented at dozens of conferences worldwide (i.e... Read More →

Monday May 9, 2016 5:10pm - 6:00pm PDT
Georgia A

Interfacing with Big Data, Intermediate

5:10pm PDT

Large Scale SolrCloud Cluster Management via APIs - Anshum Gupta, IBM Watson

Apache Solr is widely used by organizations to power their search platforms and often support multiple users. A lot of cluster management APIs were introduced over the last few releases, allowing the users to to manage operations ranging from replica placement to forcing leader elections via API calls. At the end of this talk, intermediate Solr users would understand what’s available, and when can they avoid direct interference with the system, leading to more stable clusters and lower chances of nodes going down. The attendees would also be much better equipped to build their own SolrCloud cluster management tools. I would also talk about when not to use these APIs and what’s planned in the near future to handle specific operational use cases.

Speakers

Anshum Gupta

Sr. Software Engineer, IBM Watson

Anshum Gupta is a Lucene/Solr committer and PMC member with over 10 years of experience with search. He is a part of the search team at IBM Watson, where he works on extending the limits and improving SolrCloud. Prior to this, he was a part of the open source team at Lucidworks and... Read More →

Large Scale SolrCloud Cluster Management via APIs pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Regency A

Managing Distributed Systems, Intermediate

5:10pm PDT

Next-Gen Decision Making in Under 2ms - Ilya Ganelin, Capital One Data Innovation Lab

What if we had reached that point where open source can handle massively difficult streaming problems with enterprise-grade durability?

Today, Ilya presents Capital One’s novel solution for real-time decisioning on Apache Apex. With an analysis of the dominant streaming frameworks, he’ll show how Apex provides unique capabilities ensuring less than 2ms latency in an enterprise-grade solution on Hadoop.

He’ll first take a detailed dive into the business requirements of a new real-time decisioning platform for model building, feature computation, and model scoring. Next, a survey of the leading open source technologies for stream processing and what tradeoffs we considered when selecting our technology stack. Lastly, how Apex provides un-paralleled performance and meets the stringent performance, scalability, and durability requirements necessary for enterprise-grade decisioning.

Speakers

Ilya Ganelin

Senior Data Engineer, Capital One Data Innovation Lab

Ilya is a roboticist turned data engineer. At the University of Michigan he built self-discovering robots and then worked on embedded DSP software with cell phone radios at Boeing. Today, he drives innovation at Capital One. Ilya is a contributor to the core components of Apache Spark... Read More →

Monday May 9, 2016 5:10pm - 6:00pm PDT
Plaza B

Math & Standards, Intermediate

5:10pm PDT

The New Time Series Kid on the Block - Florian Lautenschlager, QAware GmbH

There is a new open source time series database on the block that allows one to store billions of time series points and access them within a few milliseconds.
Chronix [1] is a young but mature open source time series database that catches a compression rate of 98% compared to data in CSV files while an average query took 21 milliseconds. Chronix is built on top of Apache Solr [2], a bulletproof NoSQL database with impressive search capabilities. Chronix relies on Solr plugins and everyone who has a Solr running can create a new Chronix core within a few minutes.
In this session we show how Chronix achieves its efficiency in both by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with pre-computed attributes, and by specialized time series query functions.

[1] http://chronix.io
[2] http://lucene.apache.org/solr/

Speakers

Florian Lautenschlager

Engineer, QAware GmbH

Florian Lautenschlager is an architect at QAware GmbH Germany. He is also a guest researcher at FAU Erlangen-Nürnberg. Florian studied Computer Science at the University of Applied Science Rosenheim. He works on a research project called Design for Diagnosability in which time series... Read More →

Apache Con The new time series kid on the block pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Regency E

New Projects, Beginner

5:10pm PDT

Building a Durable Real-Time Data Pipeline: Apache BookKeeper at Twitter - Sijie Guo & Leigh Stewart, Twitter

Log has been proven to be a very powerful data structure for addressing challenging distributed systems problems. DistributedLog is such a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s durable real-time data pipeline, and has been used widely elsewhere at Twitter in applications including transactional database system, search ingestion pipeline, and real-time streaming data-analytics platform. In this talk, Sijie Guo will discuss what are the challenges on building durable real-time data pipeline, how they achieve it and how they use it to support different workloads with different characteristics from a strongly-consistent distributed database to a real-time data analytics pipeline.

Speakers

Sijie Guo

Twitter

Currently work for Twitter on DistributedLog/BooKeeper. Apache BookKeeper PMC Chair. Previously work for Yahoo! on push notification system.

Leigh Stewart

Twitter

Building a Durable Real Time Data Pipeline Apache BookKeeper at Twitter pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Georgia B

Operations-Use Cases, Any

5:10pm PDT

Apache Tika - What’s New with 2.0? - Nick Burch, Quanticate

Apache Tika detects and extracts metadata and text from a huge range of file formats and types. From Search to Big Data, single file to internet scale, if you’ve got files, Tika can help you get out useful information!

Apache Tika has been around for nearly 10 years now, and with the passage of all that time, plus the new 2.0 release, a lot has changed. Not only has there been a huge increase in the number of supported formats, but the ways of using Tika have expanded, and some of the philosophies on the best way to handle things have altered with experience. Tika has gained support for a wide range of programming languages to, and more recently, Big-Data scale support.

Whether you’re an old-hand with Tika looking to know what’s hot or different with 2.0, or someone new looking to learn more about the power of Tika, this talk will have something in it for you!

Speakers

Nick Burch

CTO, Quanticate

Nick began contributing to Apache projects in 2003, and hasn't looked back since! He's mostly involved in "Content" projects like Apache POI, Apache Tika and Apache Chemistry, as well as foundation-wide activities like Conferences and Travel Assistance.Nick is CTO at Quanticate, a... Read More →

Apache Tika What’s New with 2.0 pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Regency B

State-Future of $foo, Intermediate

5:10pm PDT

Speaking the Language of Big Data - With Apache Avro and Apache Thrift - Ranganathan Balashanmugam, ThoughtWorks

With the advent of feature based teams, software architecture styles like Microservices and deployment patterns like Devops are taking over. Each team takes autonomous decisions on technologies used, but there is always a need to define a common language for the services to communicate with each other. This way there will be a common wire format and avoid lot of mappers across the application. The other common scenario is in big data projects where the cluster of nodes need to communicate efficiently and effectively, with ease of API.
This talk highlights on Apache Avro and Apache Thrift which are used in Big data solutions -- which act as common language across different services/nodes in big data applications. These technologies act as language and platform neutral way of serializing structured data. This talk also shows examples and demos -- highlighting the pain points they solve.

Speakers

Ranganathan Balashanmugam

Head of Engineering - India, Aconex

Ranganathan has nearly twelve years of experience of developing awesome products and loves to works on full stack - from front end, to backend and scale. He is Head of Engineering - India at Aconex and prior to that was Technology Lead at ThoughtWorks. He is Microsoft MVP for Data... Read More →

Apache Big Data 2016 SpeakingTheLanguageOfBigData pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Plaza A

Streams, Intermediate

6:00pm PDT

Onsite Attendee Reception & Technology Showcase

Monday May 9, 2016 6:00pm - 7:30pm PDT
Regency Foyer

7:30am PDT

Breakfast

Tuesday May 10, 2016 7:30am - 9:00am PDT
Regency Foyer

8:00am PDT

Registration

Tuesday May 10, 2016 8:00am - 9:00am PDT
Georgia Foyer

8:00am PDT

Technology Showcase

Tuesday May 10, 2016 8:00am - 4:15pm PDT
Regency Foyer

9:00am PDT

Open Geospatial Standards and Open Source - George Percival, Open Geospatial Consortium (OGC)

The aim of this talk and the geospatial track is to increase the benefits of implementing open source consistent with open geospatial standards. Open standards capture geospatial knowledge gained from previous experience for reuse. Accuracy in data exchange is increased by using standards. Even “simple” use cases done inconsistently cause errors, e.g. coordinate order. Standards from the Open Geospatial Consortium (OGC) applicable to Apache projects include: coordinate systems, geometry, grids, spatial relations, web services, encodings, metadata. Multiple Apache projects include geospatial implementations as highlighted in this track. To aid in code refuse this track seeks to increase coordinations across Apache projects based on geospatial standards as well as with other external activities. An anticipated outcome of this track is increasing geo-collaboration of Apache and OGC.

Speakers

George Percivall

CTO, Chief Engineer, OGC

As CTO and Chief Engineer of the Open Geospatial Consortium (OGC), George Percivall is responsible for the OGC Interoperability Program and the OGC Compliance Program. His roles include articulating OGC standards as a coherent architecture, as well as addressing implications of technology... Read More →

Open Geospatial Standards and Open Source pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Plaza A

Geospatial, Intermediate

9:00am PDT

Shared or Distributed HDFS - What’s Right for Me? - Janet George, SanDisk

Currently, there are two competing architectures for how to implement HDFS. The original HDFS approach utilizes storage colocated with the compute servers. An emerging alternative relies on dedicated storage resources shared by the compute cluster. This talk will compare and contrast these two approaches and provide definitive quantitive guidelines to planners and architects to help them identify the best solutions for their needs.

Speakers

Janet George

Fellow, Chief Data Scientist Big Data Platform/Data Science/Cognitive Computing, SanDisk

At SanDisk, Janet is involved with building global core competencies, shaping, driving and implementing the Big Data platform, products and technologies, using advanced analytics and pattern matching with semiconductor manufacturing data from the ground up. Janet's industry experience... Read More →

Tuesday May 10, 2016 9:00am - 9:50am PDT
Regency B

HDFS-Storage, Intermediate

9:00am PDT

Apache Kafka at Rocana - Alan Gardner, Rocana

Rocana Ops is a platform for collecting and performing analysis on IT Operations data at a massive scale. This presentation will discuss why we chose Apache Kafka as part of our infrastructure, and why Kafka makes it possible for us to scale our ingestion framework to accept data from tens of thousands of servers. It will also discuss potential obstacles and challenges for organizations looking to adopt Kafka.

Speakers

Alan Gardner

Rocana

Alan works on the Platform team at Rocana, focusing on data collection. In the past he’s also worked as a consultant, designing scalable solutions for data ingestion, storage and querying using Kafka and Hadoop ecosystem tools. He loves distributed systems and performance optimization... Read More →

Apache Kafka at Rocana pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Regency A

Kafka, Beginner

9:00am PDT

Wikimedia Content API: A Cassandra Use-case - Eric Evans, Wikimedia Foundation

The Wikimedia Foundation is a charitable organization with a vision of a world where everyone can freely share in the sum of all knowledge. Each month it serves over 18 billion page views to 500 million unique visitors around the world.

Among the resources offered by Wikimedia is an API providing low-latency access to full-history content, in many formats. Its results are often the product of computationally intensive transforms, and must be pre-generated and stored to meet latency expectations. Unsurprisingly, there are many challenges to providing low-latency access to such a large data-set, in a demanding, globally distributed environment.

This talk will cover the Wikimedia content API and its use of Apache Cassandra as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature will be discussed.

Speakers

Eric Evans

Senior Software Engineer, Wikimedia Foundation

Eric has more than a decade of experience with the engineering and operations of large-scale distributed systems. He joined Rackspace as a startup, and implemented a global DNS infrastructure utilizing IP anycast (possibly the first), and a novel data-center-wide IDS for which a patent... Read More →

Tuesday May 10, 2016 9:00am - 9:50am PDT
Plaza B

NoSQL, Intermediate

9:00am PDT

Cancer Outlier Profile Analysis Using Spark - Mahmoud Parsian, Illumina, Inc.

Cancer Outlier Profile Analysis (COPA) is a method to find genes
that undergo recurrent fusion in a given cancer type by finding
pairs of genes that have mutually exclusive outlier profiles.
COPA is used for detecting translocations of the second type
using microarray data. The goal of COPA is to identify genes
that have a subset of disease samples with outstanding high/low
values. We have implemented COPA in Spark for production, which
can process millions of biomarkers for one-sided and two-sided
analysis, where each biomarker may have thousands of genes.
Selection of the Spark for COPA implementation was a natural
choice, since Spark offers natural join and filter operations
(main steps in COPA implementation) in a very high level manner,
which is lacking from traditional MapReduce API. This presentation
will show how we used Spark to solve a complex COPA.

Speakers

Mahmoud Parsian

Illumina, Inc.

Mahmoud Parsian, Ph.D. in Computer Science, is a practicing software professional with 30 years of experience as a developer, designer, architect, and author. For the past 15 years, he has been involved in Java server-side, databases, MapReduce, Hadoop, Spark, and distributed... Read More →

Cancer Outlier Profile Analysis Using Spark pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Georgia B

Operations-Use Cases, Intermediate

9:00am PDT

Enforcing Fine Grained Role Based Authorization in Multi-tenant Streaming Data Platforms - Ashish Singh, Cloudera

Reliable, high-rate ingestion of data from a large variety of sources is the first step toward answering big analytical questions, and Apache Kafka, a scalable publish-subscribe messaging system, is a popular choice for this goal. With the increasing adoption of Kafka, security has become more important than ever. In this talk, Ashish Singh will review recent advancements made in Kafka towards closing security gaps, and discuss how addition of pluggable authorization in Kafka has enabled Apache Sentry (incubating) to provide an enterprise-grade, fine-grained, role-based authorization in Kafka. The talk will conclude with a demonstration of a working example of how administrators can rely on Sentry for enforcing fine-grained authorization in multi-tenant Kafka platforms.

Speakers

Ashish Singh

Software Engineer, Cloudera

Ashish Singh is a Software Engineer, working with Cloudera to empower the Hadoop ecosystem to answer bigger questions. Ashish studied Computer Science and Engineering at Ohio State University. Before working in the Big Data space, he worked on optimizing MPI collective communications... Read More →

Enforcing Fine Grained Role Based Authorization in Multi tenant Streaming Data Platforms pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Regency E

Security, Any

9:00am PDT

Random Forest Clustering with Apache Spark - Erik Erlandson, Red Hat, Inc.

Analytics applications often boil down to grouping objects into two or more clusters having similar elements. Defining what “similar” means can be surprisingly difficult when data elements have many columns or dimensions. Having tools at hand to generate quality clusters from high-dimensional data greatly increases the variety of applications that can successfully leverage clustering.

In this presentation, Erik Erlandson will introduce the basic principles and advantages of Random Forest learning models and Random Forest clustering. He will explain how to build up an implementation of Random Forest clustering in the Apache Spark analytics framework, based on the Spark MLLib Random Forest modeling API.

The presentation will include examples of Random Forest clustering applied to VM installed-package profiles and a discussion of practical issues encountered along the way.

Speakers

Erik Erlandson

Senior Principle Software Engineer, Red Hat

Erik Erlandson is a Software Engineer at Red Hat Emerging Technologies, where he leads a team dedicated to exploring tools, methodologies and use cases at the intersection of Data Science workloads and the Kubernetes ecosystem.

Random Forest Clustering with Apache Spark pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Plaza C

Spark, Intermediate

9:00am PDT

SQL on Hadoop/Big Data - Architecture, Technology and Roadmap - Sumit Pal, Big Data Consultant

Talk Topic - "SQL on Hadoop - Architecture, Technology and Road Ahead"

This talk - will give an exhaustive overview of how SQL is done on Hadoop more foccused
on low latency SQL on Hadoop.
The various open source and commercial tools to perform SQL on Hadoop and their
internal architectures. The tools cover - Hive, Hive on Tez, Spark SQL, Impala, Apache
Drill, Presto, Tachyon based architecture etc.

The talk also covers how SQL can be used for Structured, UnStructured and Streaming
Data the concepts behind them and shows demo of using SQL - for JSON, Structured and
Streaming Data.

The talk also covers the changes coming in this field - with products like OLAP
on Hadoop, BlinkDB, NuoDB and HTAP based solutions.

Speakers

Sumit Pal

Director Strategic Solutions Architecture, Ontotext

Sumit is an Ex Gartner VP Analyst in Data Management & Analytics space. At Gartner I used to advice CTOs, CDOs, CDAOs, Enterprise Architects and Data Architects on Data Strategy, Data Architectures, implementation and choosing tools, frameworks and vendors for building data platforms... Read More →

SQL on Hadoop Big Data Architecture, Technology and Roadmap pdf

Tuesday May 10, 2016 9:00am - 9:50am PDT
Georgia A

SQL Interaction, Intermediate

10:00am PDT

Big Date Trends in Today's Big Job Market - Richard Maldonado, Dice

Through aggressive investments in Data Science, Dice has transformed massive collections of tech resumes, profiles, job postings, and surveys into a vast array of meaningful insights such as market trends, salary predictions, skill mapping, and career pathing just to name a few. In this session, Dice will share a variety of trends happening in the Big Data job market as well as demonstrate how to leverage Dice tech community and ecosystem to gain visibility for Apache Big Data projects that are important to you.

Speakers

Richard Maldonado

Lead Product Manager, Dice

Richard is the lead Product Manager at Dice for the Tech Pro digital experience. With over 15 years in Enterprise Software and IT solutions, he has managed market leading enterprise software products and spearheaded several efforts to build, partner, and leverage open source projects. Richard... Read More →

Tuesday May 10, 2016 10:00am - 10:50am PDT
Regency B

HDFS-Storage, Any

10:00am PDT

Apache BigTop Hadoop Dev Test Benchmark on Your Favorite Cloud or Laptop - Antonio Rosales, Canonical & Konstantin Boudnik, Memcore

Apache Bigtop is the foundation for Open Source BigData Projects. In this talk, we discuss how you can 1-command deploy a Apache Bigtop multi-node Hadoop cluster with Ganglia monitoring to your favorite cloud or onto containers on your laptop. Then dev/test and bechmark your cluster all with Open Source tools.

Speakers

Antonio Rosales

Canonical

Collaborating with communities to help folks get answers faster.

Tuesday May 10, 2016 10:00am - 10:50am PDT
Plaza A

Interfacing with Big Data, Any

10:00am PDT

Kafka at Peak Performance - Todd Palino, Linkedin

Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.

Speakers

Todd Palino

Staff Site Reliability Engineer, http://linkedin.com/

Todd Palino is a Staff Site Reliability Engineer at LinkedIn, tasked with keeping Zookeeper, Kafka, and Samza deployments fed and watered. He is responsible for architecture, day-to-day operations, and tools development, including the creation of an advanced monitoring and notification... Read More →

Tuesday May 10, 2016 10:00am - 10:50am PDT
Regency A

Kafka, Intermediate

10:00am PDT

SASI - A Revolution for Secondary Indexes in Cassandra - Hanneli Tavante, Codeminer 42

Secondary Indexes in Cassandra usually are a wide topic for discussion. Regarding the benefits it can bring in terms of providing a better view and sorting for data, sometimes you shall handle performance issues.
Several strategies have been presented, and in 2015 Apple open sourced its implementation for Secondary Indexes, called SSTableAttachedSecondaryIndex, or just SASI (Github - https://github.com/xedin/sasi ). The main goal of this talk is to show the benefits of this implementation and how it could be used to reduce performance issues. Also, a step-by-step on the implementation will be provided, explaining the insights behind the adopted data-structures and project general architecture.

Speakers

Hanneli Tavante

SOFTWARE DEVELOPER at CODEMINER 42, Codeminer 42

Hanneli is a software developer at Codeminer 42. She enjoys learning new programming languages, blowing capacitors and helping the community by organising meetups (Neo4j, Cassandra, Rust, Science) and presenting talks around the globe. She also likes Math, Lego, dogs, hardware and... Read More →

SASI A Revolution for Secondary Indexes in Cassandra pdf

Tuesday May 10, 2016 10:00am - 10:50am PDT
Plaza B

NoSQL, Any

10:00am PDT

Breaking Spark: Top 5 Mistakes to Avoid When Using Apache Spark in Production - Neelesh Srinivas Salian, Cloudera

Apache Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and it continues to push the boundaries of the engine.

This talk will focus on common problematic issues observed in a cluster environment setup with Apache Spark, based on the presenter’s experiences across 150+ production deployments.

When planning a Apache Spark deployment in a cluster, it is recommended to follow certain guidelines to help setup a real-world environment. The classification of issues that can occur are:

1) Scaling of the Architecture
2) Memory Configurations
3) End user Code
4) Incompatible Dependencies
5) Administration/Operation related issues.

These observations are very useful as they help to improve the usability and supportability of Apache Spark to avoid such issues in future deployments.

Speakers

Neelesh Srinivas Salian

Software Engineer, Stitch Fix

Neelesh Srinivas Salian is a Software Engineer on the Data Platform team at Stitch Fix, where he works on the compute infrastructure used by data scientists. He helps build services that are part of Stitch Fix’s Data Warehouse ecosystem. Currently he is working to build Data Lineage... Read More →

Spark Talk Cloudera pdf

Tuesday May 10, 2016 10:00am - 10:50am PDT
Georgia B

Operations-Use Cases, Beginner

10:00am PDT

Protecting Enterprise Data in Hadoop - Owen O'Malley, Hortonworks

Hadoop has long had strong authentication via integration with Kerberos,
authorization via User/Group/Other HDFS permissions, and auditing via
the audit log. Recent developments in Hadoop have added HDFS file access
control lists, pluggable encryption key provider APIs, HDFS snapshots,
and HDFS encryption zones. These features combine to give important new
data protection features that every company should be using to protect
their data. This talk will cover what the new features are
and when and how to use them in enterprise production environments.
Upcoming features including columnar encryption in the ORC columnar format
will also be covered. By encrypting particular columns, enterprises can
control which users have access to particularly sensitive columns that
contain personally identifiable information or financial information.

Speakers

Owen O’Malley

Co-founder & Sr Architect, Hortonworks

Owen O’Malley is a co-founder and architect at Hortonworks, which develops the completely open source Hortonworks Data Platform (HDP). HDP includes Hadoop and the large ecosystem of big data tools that enterprises need for data analytics. Owen has been working on Hadoop since 2006... Read More →

Protecting Enterprise Data in Hadoop pdf

Tuesday May 10, 2016 10:00am - 10:50am PDT
Regency E

Security, Intermediate

10:00am PDT

Clickstream Analysis with Apache Spark - Andreas Zitzelsberger, QAware GmbH

On large-scale web sites, users leave thousands of traces every second. Businesses need to process and interpret these traces in real-time to be able to react on the behavior of their users.
In this talk, Andreas will show a real world example of the power of a modern open-source stack.
He will walk you through the design of a real-time clickstream analysis PAAS solution based on Apache Spark, Kafka, Parquet and HDFS, explain our decision making and present our lessons learned.

Speakers

Andreas Zitzelsberger

Principal Software Architect, QAware GmbH

Andreas is Principal Software Architect at QAware, an independent cloud native software manufacturer that has been repeatedly awarded Best IT Workplace in Germany. His focus is cloud native computing in all its glory. He is responsible for the heavy lifting at a large-scale cloud... Read More →

apachebigdata clickstream analysis pdf

Tuesday May 10, 2016 10:00am - 10:50am PDT
Plaza C

Spark, Any

10:00am PDT

Apache Hive 2.0 SQL Speed Scale - Alan Gates, Hortonworks

Apache Hive is the most commonly used SQL interface for Hadoop. To meet users data warehousing needs it must scale to petabytes of data,
provide the necessary SQL, and perform in interactive time. The Hive community is working towards a 2.0 release of Hive that includes significant improvements. These include:
* LLAP, a daemon layer that enables sub-second response time.
* HBase to store Hive’s metadata, resulting in significantly reduced planning time.
* Expanding Hive’s support for managing changing data in a transactionally consistent way with SQL MERGE.
* Using Apache Calcite to enable Hive to use multiple storage engines (e.g. HBase)
This talk will cover the use cases these changes enable, the architectural changes being made in Hive as part of building these features, and share performance test results on how these improvements are speeding up Hive.

Speakers

Alan Gates

Co-founder and Architect, Hortonworks

Tuesday May 10, 2016 10:00am - 10:50am PDT
Georgia A

SQL Interaction, Beginner

10:50am PDT

Coffee Break

Tuesday May 10, 2016 10:50am - 11:20am PDT
Regency Foyer

11:20am PDT

Applying Geospatial Analytics Using Apache Spark Running on Apache Mesos - Adam Mollenkopf, Esri

This session will explore how to apply spatiotemporal analytics using Apache Spark on high velocity streaming data-in-motion and high volume batch data-at-rest. A comparison of available open source geospatial libraries will be reviewed including Apache SIS, Magellan, JTS, and the esri/geometry-api-java. Demonstrations will be shown on how to integrate a geospatial library with Spark analytics and how these analytics can be run on an Apache Mesos cluster to provide a highly scalable solution with elastic capabilities. Examples will focus on applications in the connected car space and smart cities and smart communities.

Speakers

Adam Mollenkopf

Real-Time & Big Data GIS Capability Lead, Esri

Adam Mollenkopf is responsible for the strategic direction Esri takes towards enabling real-time and big data capabilities in the ArcGIS platform. This includes having the ability to ingest real-time data streams from a wide variety of sources, performing continuous and recurring... Read More →

Applying Spatiotemporal Analytics pdf

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Plaza A

Geospatial, Intermediate

11:20am PDT

HDFS and Private Cloud - Janet George, SanDisk

Traditionally, Hadoop clusters have been built using dedicated hardware separated from the rest of the data center IT infrastructure. The rapid growth of HDFS/Hadoop/Spark/Yarn applications makes it desirable to share the services-oriented virtualized infrastructure commonly known as private cloud. While virtualizing compute and network interconnectivity is a relatively well-solved problem. Virtualizing the HDFS storage component into the private cloud work has unique challenges. This talk explores those challenges and offers multiple prescriptive solutions along with criteria to allow planners and architects to meaningfully compare and contrast the different approaches.

Speakers

Janet George

Fellow, Chief Data Scientist Big Data Platform/Data Science/Cognitive Computing, SanDisk

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Regency B

HDFS-Storage, Intermediate

11:20am PDT

Streaming Data Integration at Scale with Kafka - Ewen Cheslack-Postava, Confluent

The last decade as seen a dramatic shift in the complexity of data pipelines. Data is stored in more systems, queried in more ways, and comes from more sources. Complex data pipelines combined with the need for applications that can analyze and respond to that data in real-time leave traditional approach to data integration struggling to keep up.

This talk will describe how data integration is shifting to a streaming model and how Kafka supports this new model. Specifically, it will focus on a new tool included with Kafka, Kafka Connect, that handles streaming "E" and "L". It will describe Kafka Connect’s data and execution models, which provide scalable fault-tolerant import and export between Kafka and other data systems. Finally, it will show how this can be combined with other tools such as stream processing frameworks to create a complete streaming data integration solution.

Speakers

Ewen Cheslack-Postava

Confluent

Ewen Cheslack-Postava is a Kafka committer and engineer at Confluent building a stream data platform based on Apache Kafka to help organizations reliably and robustly capture and leverage all their real-time data. He received his PhD from Stanford University where he developed Sirikata... Read More →

Streaming Data Integration at Scale with Kafka pdf

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Regency A

Kafka, Intermediate

11:20am PDT

Cassandra Multi-datacenter Operations Essentials - Julien Anguenot, iland Internet Solutions, Corp

Apache Cassandra operations have the reputation to be quite simple against single datacenter clusters and / or low volume clusters but they become way more complex against high latency multi-datacenter clusters: basic operations such as repair, compaction or hints delivery can have dramatic consequences even on a healthy cluster.

In this presentation, Julien will go through Cassandra operations in details: bootstrapping new nodes and / or datacenter, repair strategies, compaction strategies, GC tuning, OS tuning, large batch of data removal and Apache Cassandra upgrade strategy.

Julien will give you tips and techniques on how to anticipate issues inherent to multi-datacenter cluster: how and what to monitor, hardware and network considerations as well as data model and application level bad design / anti-patterns that can affect your multi-datacenter cluster performances.

Speakers

Julien Anguenot

VP Software Engineering, iland Internet Solutions, Corp

Julien is an accomplished and organized software craftsman with a creative and entrepreneurial spirit. Julien serves as iland’s Vice President of Software Engineering and is responsible for the strategic vision and development of iland’s Cloud Services platform. Under his leadership... Read More →

iland apache con 2016 long pdf

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Plaza B

NoSQL, Intermediate

11:20am PDT

A Java Implementer’s Guide to Boosting Apache Spark Performance - Tim Ellison, IBM

Apache Spark has rocked the big data landscape, becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark’s core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk Tim Ellison, a JVM developer at IBM, shows some of the unique Java 8 capabilities in the JIT compiler, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. Tim will demonstrate how solutions, previously infeasible with regular Java programming, become possible with this high performance Spark core runtime, enabling you to solve problems smarter and faster.

Speakers

Tim Ellison

Tim Ellison is currently a Senior Technical Staff Member with IBM's Java Technology Centre in the UK. He has worldwide responsibility for Open Source Engineering in the Java SDK underpinning a broad selection of IBM's flagship products. He is a Member of the Apache Software Foundation... Read More →

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Georgia B

Operations-Use Cases, Intermediate

11:20am PDT

Apache Eagle - Identify Threats Instantly Through Policy Engine and User Profile - Medha Samant, eBay

Apache Eagle is an Open Source Monitoring framework for Hadoop to instantly identify access to sensitive data, recognize attacks, malicious activities in Hadoop and take actions in real time. Eagle provides distributed, fault-tolerant policy engine and out of box machine learning models to create user profiles offline based on historic user behaviors and detects anomalies online.

Apache Eagle was initially created for filling some obvious gaps in Hadoop security landscape and soon expanded to Hadoop system monitoring including map/reduce job monitoring, data node anomaly detection, master node garbage collection activity monitoring etc.

Apache Eagle’s core is the fully distributed policy evaluation engine. It solved common yet hard problems for traditional monitoring, like horizontal scalability, data skew, policy fault-tolerance, fluent stream DSL etc.

Speakers

Medha Samant

Director, Product Management, eBay

Medha Samant is Director of Product Management at eBay; driving product strategy and execution for Data Analytics and Business intelliegnece platform. Medha has over 20 years of extensive and diversified experience across product development, product innovation and strategy, product... Read More →

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Regency E

Security, Advanced

11:20am PDT

Spark Cyborgs - Deep Integration of Spark with Parallel Relational Engines - Torsten Steinbach & Gustavo Arocena, IBM

In this session we describe a family of hybrid engines that result from a deep two-way integration between Spark and parallel RDBMSs. This integration differs from projects like Hive on Spark, that leverage Spark purely as an execution framework. It also goes beyond what’s possible with the current version of the DataSources API in terms of leveraging the capabilities of the storage backend. In our presentation you will learn about four essential building blocks of the hybrid engines:
1. Derive DataFrame partitioning implicitly from parallel RDBMS partitioning
2. Colocation and efficient data movement between Spark and RDBMS processes
3. Hybrid queries by augmenting parallel RDBMS with Spark
4. Spark machine learning integrated in RDBMS for relational data

Speakers

Gustavo Arocena

Big Data Architect, IBM

Gustavo Arocena is a Big Data Architect at the IBM Toronto Lab, with more than 10 years of experience in database technology and language processing. Recently he has lead the design and implementation of several components of the Big SQL engine, including the Hive-compatible IO layer... Read More →

Torsten Steinbach

IBM

Torsten has been a software architect for database technology in IBM for many years. He lead product development for DB2 performance management tooling, Netezza workload management and in-database analytics. Currently he works on IBM’s cloud data warehouse dashDB and it’s integrated... Read More →

Spark Cyborgs pdf

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Plaza C

Spark, Intermediate

11:20am PDT

Get the Best Out of Hive and Spark - Xuefu Zhang, Uber

Apache Hive has wide use cases for batch-oriented SQL workloads for ETL and data analytics in the Hadoop ecosystem. Its rich features haven’t been matched by any other available SQL on Hadoop tools. In fact, many these tools are tied to and depend on Hive one way or the other. Apache Spark, on the other hand, offers a general data processing framework positioned to replace MapReduce with its faster data processing and efficient memory utilization. Moreover, one doesn’t have to abandon one for another or juggle between the two in order to get both sets of benefits, as Hive on Spark maintains Hive’s feature richness while providing faster SQL on Hadoop execution. As the adoption of Hive on Spark for production use, This presentation will share with you the best practice of deployment and performance tuning which enables you to gain the best out of the two projects.

Speakers

Xuefu Zhang

Software Engineer, Uber Technologies

Xuefu Zhang has over 10 year’s experience in software development. Earlier this year he joined as a software engineer in Uber from Cloudera, where he spent his main efforts on Apache Hive and Pig. He also worked in the Hadoop team at Yahoo when the majority of the development on... Read More →

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Georgia A

SQL Interaction, Any

12:10pm PDT

Lunch (Attendees on Own)

Tuesday May 10, 2016 12:10pm - 2:00pm PDT
TBA

2:00pm PDT

SciSpark: MapReduce in Atmospheric Sciences - Kim Whitehall, NASA Jet Propulsion Laboratory

The atmospheric science (AS) community generates model and observational data to simulate and monitor the Earth system. Big data in the AS community has arrived: high volumes (petabytes), at increasing velocity (to AS groups worldwide) and variety (of data formats and resolutions), are need for the veracity of models and observation systems that add value to the policy-making process. As scientists require solutions that allow interaction with these big data, the community is interested in the Map Reduce paradigm and Apache Spark. This talk presents a specific NASA Advanced Information Systems Technology (AIST) project called “SciSpark” that marries Apache Spark with climate science. SciSpark is a scalable system for interactive AS analysis. We will demonstrate SciSpark’s scientific data ingestion, visual interaction and metrics generation using the Spark engine.

Speakers

Kim Whitehall

NASA Jet Propulsion Laboratory

Kim is a scientific applications software engineer at NASA’s Jet Propulsion Laboratory.

SciSpark MapReduce in Atmospheric Sciences pdf

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Plaza A

Geospatial, Beginner

2:00pm PDT

Leveraging YCSB for Your Project - Sean Busbey, Cloudera

YCSB is an open source framework for evaluating data storage systems that has become the de facto standard for use with NoSQL projects. After several quiet years the project has returned to life as a community effort, which has led to substantial utility and system coverage improvements. Sean Busbey will review the last six months of development, explain how users of Apache projects can get a better understanding of their preferred storage system, and discuss ways some folks proactively use YCSB to improve the quality of their storage project.

Speakers

Sean Busbey

Cloudera

Sean Busbey currently works at Cloudera as a software engineer on distributed storage systems. In addition to being a Member of the Apache Software Foundation, he is actively involved in several projects including: HBase, Yetus, Avro, NiFi, and Accumulo. Outside of the ASF, he is... Read More →

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Regency B

HDFS-Storage, Intermediate

2:00pm PDT

Building a Self-serve Kafka Ecosystem - Joel Koshy, LinkedIn

Apache Kafka has enjoyed widespread adoption as a messaging backbone for data pipelines and stream processing platforms.

LinkedIn runs one of the largest known deployments of Kafka and serves hundreds of applications within the company. As new use-cases for Kafka emerge it is becoming increasingly critical to provide self-serve features for users without having to always engage Kafka specialists. Users need to create topics with non-default configurations and easily examine various topic metadata and schemas; topic owners may want to specify authorization rules and be able to encrypt their data. Furthermore, Kafka brokers need mechanisms in place to protect against rogue clients that impact the cluster and other clients.

In this talk we will describe how we are addressing these practical challenges in providing a truly multi-tenant messaging service.

Speakers

Joel Koshy

Joel Koshy is a Staff Software Engineer in LinkedIn’s Data Infrastructure team. He is also a PMC member and committer on the Apache Kafka project. Joel has worked on distributed systems infrastructure and applications for the past eight years. Prior to LinkedIn, he was with the... Read More →

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Regency A

Kafka, Intermediate

2:00pm PDT

Zipkin & Apache Cassandra: A Big Data Tracing Case Study - Mick Semb Wever, The Last Pickle

Monitoring provides information on system performance, however tracing is necessary to understand the performance of individual requests.

Systems such as Zipkin, Dapper, and HTrace provide distributed tracing; as does CQL request tracing in Apache Cassandra. Such tracing is invaluable when diagnosing individual requests, yet knowing which database queries to trace and why they were made still requires deep technical knowledge. And while each solution provides insight, the problem of providing a single tracing view across a distributed application stack remains.

This talk will introduce using Zipkin to record Cassandra request traces, to provide a single tracing view from HTTP server to database. Starting with CQL request tracing; we will move onto Zipkin, ongoing work to record request traces via Zipkin, and the efforts of the OpenTracing community create a common tracing API.

Speakers

Mick Semb Wever

Team Member, The Last Pickle

Mick Semb Wever works at The Last Pickle helping customers deliver and improve Apache Cassandra based solutions. Prior to TLP he spent seven years at FINN.no building their Microservices platform utilizing Apache Cassandra, Hadoop, Spark and Kafka. He is the PMC Chair for Apache Tiles... Read More →

Zipkin and Apache Cassandra A Big Data Tracing Case Study pdf

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Plaza B

NoSQL, Intermediate

2:00pm PDT

Using Apache Big Data Stack to Analyse Storm-Scale Numerical Weather Prediction Data - Suresh Marru, Indiana University

This talk will discuss adaptation of Apache Big Data Technologies to analyze large, self-described, structured scientific data sets. We will present initial results for the problem of analyzing petabytes of weather forecasting simulation data produced as part of National Oceanic and Atmospheric Administration’s annual Hazardous Weather Testbed. The challenge is to enable weather researchers to perform investigative queries over the full forecast simulation outputs to find the signatures for severe weather phenomena like tornadogenesis. Given the size of the data and the complexity of weather phenomena, these data sets are candidates for exploration by machine learning techniques that can identify heretofore unknown relationships in the dozens of weather parameters generated by the simulations, guiding researchers into developing new scientific models.

Speakers

Suresh Marru

Member, Indiana University

Suresh Marru is a Member of the Apache Software Foundation and is the current PMC chair of the Apache Airavata project. He is the deputy director of Science Gateways Research Center at Indiana University. Suresh focuses on research topics at the intersection of application domain... Read More →

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Georgia B

Operations-Use Cases, Intermediate

2:00pm PDT

Apache Kerby for Big Data Security - Kai Zheng, Intel

Big Data platform based on Apache Hadoop presents numerous security, compliance, and integration challenges in both enterprise and Internet domains. This session will present a new, comprehensive authentication solution through using Apache Kerby in Hadoop, allowing everyone can be connected everywhere in the ecosystem, in a low-risk yet secure manner, as incurred in and benefit from Kerberos. Apache Kerby is a sub-project to the Apache Directory since Jan 2015. It is an implementation of Kerberos in Java and will provide rich, intuitive and interoperable library and facilities that integrate multiple authentication mechanisms including PKINIT, OTP and token (OAuth2.0). We will introduce and discuss the solution, state of community development, highlighted features, and roadmap. We will also show the demo and explain how Kerby’s embedded nature can be leveraged in Hadoop.

Speakers

Kai Zheng

Kai is a senior software engineering in Intel that works in big data and security fields for quite a few of years. He is a key Apache Kerby initiator, Directory PMC member and Apache Hadoop committer.

Introduce Kerby To Hadoop pdf

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Regency E

Security, Intermediate

2:00pm PDT

On the Fly Retraining of Predictive Analytical Models Using Spark Streaming: An Equity-price Direction Prediction Case Study - Tijl Carpels, Ghent University

FinTech companies are facing the challenge of predicting the direction of equity prices. During this study we have used algorithms provided in Spark Mllib to address this problem. Due to the characteristics of the equity market this happens in a streaming environment requiring us to continuously monitor the performance of the predictive model. When the performance drops below a certain threshold we trigger a batch training of the model. We made a proof of concept using different open-source tools. (Apache Spark and Spark-notebook)

Speakers

Tijl Carpels

Doctoral Researcher - Data Scientist, Ghent University

Tijl Carpels received his M.Sc. degree in Business Engineering (major: Finance) in 2015 after writing a dissertation in the field of fraud prediction. Afterwards he accepted a research position at Ghent University in order to pursue a PhD in Data Analytics at the Faculty of Economics... Read More →

On the Fly Retraining of Predictive Analytical Models Using Spark Streaming pdf

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Plaza C

Spark, Intermediate

2:00pm PDT

Hive on ACID - Alan Gates, Hortonworks

Apache Hive provides SQL access for data in Hadoop. Traditionally data in Hadoop is write once read many. But with traditional data
warehousing use cases moving to Hadoop there is a need to support transactional update and delete of records. Hive has recently implemented
ACID compliant row level insert, update, and delete as well as very low latency ingestion of streaming data from tools like Storm and Flume. This is done with snapshot isolation between queries. This talk will cover the intended use cases, architectural challenges of implementing updates and deletes in a write-once file system, and details of changes to the file storage formats and transaction management system.

Speakers

Alan Gates

Co-founder and Architect, Hortonworks

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Georgia A

SQL Interaction, Intermediate

3:00pm PDT

Geospatially Enable Your Hadoop, Accumulo, and Spark Applications with LocationTech Projects

What is the average predicted temperature of BC from 2050-2099 based on forecasting models? How many tweets containing the hashtag #apachecon were sent from Canada? In general: how do we ask questions concerning location to very large sets of geospatial data? To answer these types of questions, existing large data processing frameworks like Hadoop, Accumulo and Spark need to be "geospatially enabled". LocationTech is a working group inside of the Eclipse Foundation that is home to 4 open source projects doing exactly that: GeoTrellis, GeoWave, GeoMesa, and GeoJinni (sense a pattern?). In this talk, I will give an introduction to what geospatial data is, talk about challenges in processing large sets of geospatial data, and talk about how these four LocationTech projects work with Apache projects to overcome those challenges and let us get the most out of our large geospatial data.

Speakers

Robert Emanuele

Software Developer, Azavea

Rob Emanuele is the maintainer of the open source geospatial library GeoTrellis, which provides geospatial capabilities to Apache Spark. He was the program chair for FOSS4G North America in 2015 and 2016. He is a member of the LocationTech Project Management Committee.

Geospatially Enable Your Hadoop, Accumulo, and Spark Applications with LocationTech Projects pdf

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Plaza A

Geospatial, Beginner

3:00pm PDT

Hadoop Object Store - Ozone - Anu Engineer & Arpit Agarwal, Hortonworks

Ozone is an AWS S3 like object store for Hadoop that will scale to trillions of Objects. Ozone uses primitives from HDFS and will support map-reduce like paradigms. Ozone will plug into existing Hadoop deployments seamlessly and share storage with HDFS. This is talk dives deep into the motivations, design and technical challenges in making HDFS and Ozone scale to trillions of Objects.

Speakers

Arpit Agarwal

Hortonworks Inc.

Anu Engineer

Hortonworks is a major Hadoop vendor. I have been working for Hortonworks for the last year and has been working on Ozone Object store(an S3 like interface for HDFS). I am a contributor to Hadoop - especially to HDFS. Anu Engineer : Apache Hadoop Contributor , works for Hortonworks... Read More →

Hadoop Object Store Ozone pdf

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Regency B

HDFS-Storage, Advanced

3:00pm PDT

Apache Flume or Apache Kafka? How About Both? - Jayesh Thakrar, Conversant

Flume and Kafka are seen by some to serve the same functionality and often considered as mutually exclusive. This presentation is about an implementation where both are used together as parts of a heterogeneous streaming data pipeline.

The presentation will cover the evolution of the pipeline and how it grew from being designed to handle 20 billion log lines to 90+ billion log lines a day. It will also cover Flume customization for ensuring data uniqueness as well as to allow fractional bifurcation of data from production to QA systems for continuous regression testing.

Finally, the presentation will cover monitoring of the pipeline from a holistic view as well as a detailed drill-down and associated alerting.

Speakers

Jayesh Thakrar

Sr. Software Engineer, Conversant

Jayesh Thakrar is a Sr. Data Engineer at Conversant (http://www.conversantmedia.com/). He is a data geek who gets to build and play with large data systems consisting of Hadoop, Spark, HBase, Cassandra, Flume and Kafka. To rest after a good day's work, he uses OpenTSDB with 500+ million... Read More →

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Regency A

Kafka, Intermediate

3:00pm PDT

From Big Data to Mobile Data with Apache CouchDB and PouchDB - Bradley Holt, IBM Cloudant

It’s all too easy for mobile app developers to assume that their apps will run on fast and reliable networks. The reality for end users, though, is often slow, unreliable networks with spotty coverage. What happens when the network doesn’t work, or when a device is in airplane mode? You get unhappy, frustrated users. One solution is to take an offline-first approach. An offline-first app is an app that works, without error, when there is no network connection. Offline-first apps built with Apache CouchDB and PouchDB (an open source JavaScript database) can provide better, faster user experiences by storing data locally and then synchronizing with a cloud database when a network connection is available.

Speakers

Bradley Holt

Developer Advocate, IBM Cloud Data Services

Bradley Holt is a Developer Advocate with IBM Cloud Data Services. He is the author of several publications including Scaling CouchDB and Writing and Querying MapReduce Views in CouchDB (both published by O'Reilly Media). He has spoken at numerous conferences including the O'Reilly... Read More →

Offline First 2016 05 10 pdf

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Plaza B

NoSQL, Intermediate

3:00pm PDT

Focused Crawling with Apache Nutch - Sujen Shah, NASA JPL

The vast nature of the Web has forced researchers to continually develop advanced data acquisition strategies that overcome a multitude of obstacles in order to acquire relevant topical content and assimilate it with their needs. Many groups have researched focused Web crawling techniques in order to better guide their data acquisition efforts, however few approaches consider the scenario where one wishes to undertake DD on the open Web for which no prior semantic knowledge resources are available. Sujen and his team have investigated and developed a new application of the cosine similarity metric (CSM) which has been implemented as part of a novel strategy for domainspecificDD.

In this presentation, Sujen would review the recent work in focused crawling and the ability to run similarity scoring within a production ready, scalable Web crawler, Apache Nutch.

Speakers

Sujen Shah

Scientific Applications Software Engineer, NASA Jet Propulsion Laboratory

Focused crawling with Nutch ApacheCon 2016 (4) pdf

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Georgia B

Operations-Use Cases, Any

3:00pm PDT

Enabling Universal Authorization Models Using Sentry - Hao Hao & Anne Yu, Cloudera

Sentry is an framework to provide fine-grained access control on data stored on a Hadoop cluster. Sentry has been leveraged to manage authorization policies to Hive, Solr, Impala, and Kafka. A new generic authorization model has been implemented in Sentry to allow enabling protections on various types of data existing in Hadoop engines easily. It is critical to instrument different authorization security models meeting diverse security requirements. The architecture of the new generic model is designed to plug-in various authorization model such as role based access control and attribute based access control easily using the same storage service. In this talk, Anne and Hao will present the architecture of the generic model in Sentry framework which satisfies all those targets. And they will also elaborate how to create policies to protect data for different Hadoop engines.

Speakers

Hao Hao

Hao Hao is a software engineer at Cloudera. She is working on Sentry project, a granular, role-based authorization module for Hadoop cluster. She is also a committer of Apache Sentry (incubating) project. Hao has performed extensive research on smartphone security, web security while... Read More →

Anne Yu

Software Engineer, Cloudera

I am a software engineer working as Cloudera. I am also a PMC and committer of Apache Sentry. I am interested in big data and its security technologies. I am also very interested in Technology driven education system, such as AltSchool.

Enabling Universal Authorization Models Using Sentry pdf

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Regency E

Security, Any

3:00pm PDT

Real Time BOM Explosions with Apache Solr and Spark - Andreas Zitzelsberger, QAware GmbH

Bill of materials (BOMs) are at the heart of every manufacturing process. Especially large BOMs can be found in the automotive industry, where a complex and highly variable product meets high production volumes.
Drawing from the experiences made in an ongoing real world project for a major car manufacturer, Andreas will provide an in-depth view how Apache Solr and Apache Spark were used to power an innovative architecture that provides lightning-fast BOM explosions, demand forecasts and scenario-based planning on 20 billion records per scenario.

Speakers

Andreas Zitzelsberger

Principal Software Architect, QAware GmbH

apachebigdata realtime bom explosions pdf

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Plaza C

Spark, Any

3:00pm PDT

Using Kafka and Kudu for Fast, Low-latency SQL Analytics on Streaming Data - Mike Percy & Ashish Singh, Cloudera

Apache Kudu (incubating) is a fast new columnar data store for the Hadoop ecosystem designed to enable high-performing, flexible analytic pipelines. In this talk, Mike Percy and Ashish Singh will demonstrate how Apache Kafka can be combined with Kudu to achieve low latency, high throughput analytics on streaming data. We will compare various approaches to building such a solution and demonstrate a working system for analyzing tweets in real time by combining Kafka, Kudu, and Apache Impala (incubating).

Speakers

Mike Percy

Software Engineer, Cloudera

Mike Percy is a software engineer at Cloudera and a PMC member on Apache Kudu, an open source distributed column store for the Hadoop ecosystem. He is also a PMC member on Apache Flume. Prior to joining Cloudera, Mike worked at Yahoo! building machine learning infrastructure for Big... Read More →

Ashish Singh

Software Engineer, Cloudera

Kafka Kudu talk latest pdf

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Georgia A

SQL Interaction, Any

3:50pm PDT

Coffee Break

Tuesday May 10, 2016 3:50pm - 4:30pm PDT
Regency Foyer

4:15pm PDT

Keynote: ODPi 101: Who We Are, What We Do and Don't Do - Alan Gates, Co-founder, Hortonworks

It's no surprise that application developers find it difficult to keep up with the breathtakingly large ecosystem of new and emerging Hadoop-related technologies. Hadoop, its components, and Hadoop Distros, are innovating very quickly and in different ways.

What's needed to push Hadoop even further in the enterprise is standardization and simplification. That's the mission behind the new Open Data Platform initiative (ODPi) that launched last year and warrants extra explanation.

In this session, Alan Gates, Co-Founder of Hortonworks will outline why close to 30 companies came together to be part of the the nonprofit ODPi. Organized to support the ASF, ODPi promotes innovation and development of upstream projects like Hadoop and Ambari. While not a distribution, ODPi Core is a stable base against which big data solutions providers can qualify solutions over multiple Apache Hadoop® distributions. ODPi Core is a set of software components, a detailed certification and a set of open source tests to make it easier to create big data solutions and data-driven applications.

The well-defined ODPi Core and ODPi Certification Program are designed to drive interoperability, a broad set of use cases and major growth for the big data ecosystem, not to mention a new level of choice for enterprises and end users. The reference implementation frees up developers and SIs to focus on building business-driven applications for things like fraud detection, customer behavior and data warehouse optimization

Speakers

Alan Gates

Co-founder and Architect, Hortonworks

Tuesday May 10, 2016 4:15pm - 4:25pm PDT
Regency CD

Keynote

4:30pm PDT

Keynote: More Fun, Less Friction: How Open Source Operations Will Take Big Data to the Next Level - Mark Shuttleworth, Founder, Canonical

Speakers

Mark Shuttleworth

Canonical

Mark is founder of Ubuntu and leads product design at Canonical. Mark founded Thawte, an internet commerce security company in 1996 while studying finance and IT at the University of Cape Town. In 2000 he founded HBD, an investment company, and created the Shuttleworth Found... Read More →

Tuesday May 10, 2016 4:30pm - 4:40pm PDT
Regency CD

Keynote

4:45pm PDT

Keynote: A Look Ahead at Spark 2.0 - Ion Stoica, Co-founder & Executive Chairman, Databricks

During this keynote talk, Ion will discuss the key features of the upcoming Apache Spark 2.0 release, and the longer term development directions.

Speakers

Ion Stoica

Co-founder & Executive Chairman, Databricks

Ion Stoica is a Professor in the EECS Department at University of California at Berkeley. He does research on cloud computing and networked computer systems. Ion's past work includes Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State (DPS). He is an ACM Fellow... Read More →

Tuesday May 10, 2016 4:45pm - 5:05pm PDT
Regency CD

Keynote

5:10pm PDT

Lightning Talks aka Big Data Shark Tank

This year lightning talks have been overrun by sharks. Which means, at this point, you may be wondering: is it a panel? Is it a talk? It is a Big Data Shark Tank! Back by popular demand with even sharkier judges! What is it, you ask? Well, this is just like Shark Tank TV show (think speed dating between entrepreneurs and investors) but instead of “Squirrel Boss” and “Man Candle” you'll be hearing pitches for Apache Incubator Big Data projects. Also instead of Mark Cuban and Kevin O'Leary you'll be pitching to the panel of ASF grey beards and money men (trying to convince them that your project is worthy of their esteemed attention and endorsement). The will be snark, there may be prizes, there will be reciting of Apache Way creed. But most of all there will be fun. We guaranteed that!

Moderators

Roman Shaposhnik

Director of Open Source, Linux Foundation

Apache Software Foundation and Data, oh but also unikernels

Speakers

Milind Bhandarkar

Founder, Ampool

Shane Curcuru

Founder, Punderthings Consulting

Shane serves as V.P. of Brand Management for the ASF, setting trademark and brand policy for all 250+ Apache projects, and has served as five-time Director, and member and mentor for Conferences and the Incubator. Shane's Punderthings consultancy is here to help both companies and... Read More →

Jim Jagielski

Developer, Uber

Jim Jagielski is a well-known and acknowledged expert and visionary in open source, an accomplished coder, and frequent engaging presenter on all things open, web, blockchain, and cloud related. As a developer, he’s made substantial code contributions to just about every core technology... Read More →

Mark Shuttleworth

Canonical

Tuesday May 10, 2016 5:10pm - 5:50pm PDT
Regency CD

Keynote

6:00pm PDT

BoF: Apache Beam

Are you passionate about a topic and want to share that with others? If so, sign up to lead a Birds of a Feather (BoF) discussion.

To sign up for a BoF session slot, there will be a bulletin board placed near registration in the Regency Foyer and there are seven rooms available, from 6:00pm - 7:00pm.

If there are any questions, please come to the registration desk where a staff member can assist you.

Tuesday May 10, 2016 6:00pm - 7:00pm PDT
Plaza A

BoF

6:00pm PDT

BoF: Apache Flink

Tuesday May 10, 2016 6:00pm - 7:00pm PDT
Regency A

BoF

6:00pm PDT

BoF: Aurora/Mesos

Tuesday May 10, 2016 6:00pm - 7:00pm PDT
Plaza C

BoF

6:00pm PDT

BoF: Big Data Experts - Our Responsibility for Creating Future Society

Tuesday May 10, 2016 6:00pm - 7:00pm PDT
Plaza B

BoF

6:00pm PDT

BoF: Bigtop JuJu Charms Community Meetup with Mark Shuttleworth, Canonical

Tuesday May 10, 2016 6:00pm - 7:00pm PDT
Regency E

BoF

6:00pm PDT

BoF: Geospatial in Apache Projects

Speakers

George Percivall

CTO, Chief Engineer, OGC

Tuesday May 10, 2016 6:00pm - 7:00pm PDT
Regency B

BoF

6:00pm PDT

Crypto Security Workshop hosted by Milagro

Learn how to secure what’s really important with Apache Milagro (incubating) Multi-factor authentication and certificate-less TLS for IoT, mobile apps, containers and end users. Bring your laptops! This workshop will introduce Milagro Multi-Factor Authentication with the Apache Web Server and standard off-the-shelf modules to secure web applications that are immune to authentication credential theft (i.e., password database breaches) while improving the user experience. We will also preview Milagro TLS in client / server mode using standard IoT devices for certificatelesss TLS with perfect forward secrecy.

Tuesday May 10, 2016 6:00pm - 9:00pm PDT
Mobify 3rd Floor, 948 Homer Street, Vancouver

6:00pm PDT

Vancouver Spark Meetup

See below for the agenda and please visit http://www.meetup.com/Vancouver-Spark/events/229692936/ for more information.

6:00-6:30 Networking
6:30 Chris Fregly
7:00 Mike Percy & Dan Burkert
7:30 Xuefu Zhang
8:00 Networking and wrap

Tuesday May 10, 2016 6:00pm - 9:00pm PDT
Georgia B

7:00am PDT

5k Run to Stanley Park

5k Run to Stanley Park! Meet in the Hyatt Regency Vancouver Lobby at 7am. For any questions, please contact: jfclere@gmail.com

Wednesday May 11, 2016 7:00am - 8:00am PDT
TBA

7:30am PDT

Breakfast

Wednesday May 11, 2016 7:30am - 9:00am PDT
Regency Foyer

8:00am PDT

Registration

Wednesday May 11, 2016 8:00am - 9:00am PDT
Georgia Foyer

8:00am PDT

Technology Showcase

Wednesday May 11, 2016 8:00am - 4:10pm PDT
Regency Foyer

9:00am PDT

Keynote: Apache Hadoop at 10 - Doug Cutting, Chief Architect, Cloudera

2016 marks the 10th Anniversary of Apache Hadoop. This birthday provides us an opportunity to celebrate, and also to reflect on how we got here and where we are going. Ten years ago, digital business was mostly limited to a few sectors, like e-commerce and media. Since then, we have seen digital technology become central to nearly every industry. Hadoop did not create this digital transformation, but it is a critical character in this larger story. Thus by exploring Hadoop’s tale we can better understand the century we are now in."

Speakers

Doug Cutting

Cloudera

Doug (@cutting) is the founder of several successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera in 2009, after previously working at Yahoo!, Excite, Apple, and Xerox PARC. Doug holds a Bachelor’s degree from Stanford University and is the... Read More →

Wednesday May 11, 2016 9:00am - 9:20am PDT
Regency CD

Keynote

9:25am PDT

Keynote: Role of Apache in Transforming eBay’s Data Platform - Seshu Adunuthula, Sr. Director of Analytics Infrastructure, eBay

eBay has one of the most sophisticated Data Platform’s in the industry with over 200PBs of data stored in our Hadoop and Teradata Warehouses. On average 30 TB of transactional and behavioral data is extracted on a daily basis and thousands of metrics are computed, analyzed and monitored for decision making and detecting anomalies. eBay has embarked on an ambitious project to transform the batch oriented ETL processes which could take 24 to 48 hour to near real time infrastructure. Apache Big Data Projects continue to play a critical role in this transformation process.

Speakers

Seshu Adunuthula

Sr. Director of Analytics Infrastructure, eBay

Seshu Adunuthula is Sr Director of Analytics Infrastructure at eBay responsible for managing some of the world’s largest deployments of Hadoop, Teradata and ETL Ingest platforms. He is an industry veteran with over 20 years of Distributed Computing and Analytics Experience. Prior... Read More →

Wednesday May 11, 2016 9:25am - 9:45am PDT
Regency CD

Keynote

9:30am PDT

BarCampApache

A BarCampApache is a BarCamp being facilitated by a group of people involved in the Apache Software Foundation (ASF). All topics are still welcome however! As the ASF is helping to organize, there will be a lot of people around who know a lot about Apache projects / communities / technologies, so there are normally quite a few sessions proposed on those areas. It's not exclusively Apache though, so everyone should come, and talk about fun new ideas, projects and technologiesBarCampApache will be a dynamic get together open to the public. Like other unconferences, the schedule will be determined by the participants, both Apache and non! We strongly encourage lots of people to come along and share their knowledge and ideas. We want it to be a great day of sharing for everyone, not just those at the event. Everyone coming in for the conference is encouraged to come early, as it will be a great day for all.

Wednesday May 11, 2016 9:30am - 3:00pm PDT
Seymour

BarCampApache, Any

9:50am PDT

Keynote: Making Data Accessible - Ashish Thusoo, Co-founder & CEO, Qubole

Every organization is handling data in one way or another, but today’s data tools and infrastructure continue to hinder an organization’s ability to make data accessible to less technical users. In this keynote, Ashish Thusoo, CEO and co-founder of Qubole, will discuss the gaps in organizations’ data ambitions and ability to execute.

He will cover the gap between an organization’s ability to operationalize the infrastructure needed to support ubiquitous access to data, specifically regarding administrative expertise of data systems, the ability to predict capacity and to centrally monitor and govern usage. In order to address this, Ashish will discuss how cloud platforms can offer the elasticity, automation and access planes to alleviate these issues and provide a more accessible data platform.

Additionally, despite a new class of user-friendly tools, there is still a gap between a company’s ability to make data accessible throughout throughout the organization. To truly bridge this gap, Ashish will offer strategies on how developers can take a verticalized approach to building applications on top of data so that users can benefit from easy-to-use visualizations and other tools.

Speakers

Ashish Thusoo

Co-founder, Qubole

Before co-founding Qubole, Ashish ran Facebook’s Data Infrastructure team; under his leadership the team built one of the largest data processing and analytics platforms in the world. This platform achieved not just the bold aim of making data accessible to analysts, engineers... Read More →

Wednesday May 11, 2016 9:50am - 10:10am PDT
Regency CD

Keynote

10:15am PDT

Keynote: ODPi and ASF: Building a Stronger Hadoop Ecosystem - John Mertic, Director of Program Management, ODPi

ODPi Director of Program Management, John Mertic, will explain how the work of the ODPi complements and supports that of the ASF. Since ODPi’s launch in 2015, there has been some confusion around how its work may overlap, or potentially compete, with that of the ASF. Mr. Mertic will detail how the ODPi’s specifications and by-laws reinforce the role of the ASF as the singular place where Hadoop development occurs. He will also explain how the ODPi’s focus on the downstream Hadoop ecosystem oxygenates the Big Data market and stimulates growth.

Speakers

John Mertic

Director of Program Management, The Linux Foundation

ODPi and Apache pdf

Wednesday May 11, 2016 10:15am - 10:25am PDT
Regency CD

Keynote

10:25am PDT

Coffee Break

Wednesday May 11, 2016 10:25am - 10:40am PDT
Regency Foyer

10:50am PDT

Hiding Some of Geospatial Complexity - Martin Desruisseaux, Geomatys

It is tempting to ignore the complexity of geospatial international standards on the assumption that everyone today uses coordinates given by GPS. But even though obsolescent, the NAD27 datum for instance is still critically important in the U.S. where it has been used for definitions of many legal boundaries. Even on modern datum, support of polar areas or supplemental dimensions can be challenging. In this talk, we will present a few key Apache SIS methods that handle a lot of this complexity: e.g. how to get Coordinate Reference Systems from strings and an estimation of transformation accuracy, through an API that avoid diving too deeply in the complexity of GIS. We will show an example of what happen under the hood during a cube transformation, for demonstrating what the developers gain with SIS. Finally, we will present applications for trying Apache SIS without programming.

Speakers

Martin Desruisseaux

Developer, Geomatys

I hold a Ph.D thesis in oceanography, but have continuously developed tools for helping analysis work. I used C/C++ before to switch to Java in 1997. I develop geospatial libraries since that time, initially as a personal project then as a GeoTools contributor until 2008. I'm now... Read More →

Hiding Some of Geospatial Complexity pdf

Wednesday May 11, 2016 10:50am - 11:40am PDT
Plaza A

Geospatial, Any

10:50am PDT

SystemML - Declarative Machine Learning - Luciano Resende, IBM

Machine learning in the enterprise is an iterative process. Data scientists will tweak or replace their learning algorithm in a small data sample until they find an approach that works for the business problem and then apply the Analytics to the full data set. Apache SystemML is a new system that accelerates this kind of exploratory algorithm development for large-scale machine learning problems. SystemML provides a high-level language to quickly implement and run machine learning algorithms on Spark. SystemML’s cost-based optimizer takes care of low-level decisions about how to use Spark’s parallelism, allowing users to focus on the algorithm and the real-world problem that the algorithm is trying to solve. This talk will introduce you to SystemML and get you started building declarative analytics with SystemML using a simple Zeppelin notebook and running on Apache Spark environment.

Speakers

Luciano Resende

Architect, Spark Technology Center, IBM

Wednesday May 11, 2016 10:50am - 11:40am PDT
Georgia A

Machine Learning, Intermediate

10:50am PDT

Using a Relative Index of Performance (RIP) to Determine Optimum Configuration Settings Compared to Random Forest Assessment Using Spark - Diane Feddema, Red Hat Inc, Canada

Computer Systems can be set with a myriad of options, determining an optimal set-up for any particular application can be difficult. This pilot study demonstrates how numerous I/O performance tests with varied hardware and software configurations can be efficiently compared to determine an optimal set-up for an application. To simplify this process a statistic was developed to provide a quick relative performance comparison. This metric can be arithmetically manipulated to provide meaningful averaging of multiple performance tests into a single overall performance indicator. We will illustrate how RIP is used by comparing I/O performance test results on approximately 50-100 different hardware/software set-ups; RIP results will be compared to results from a random forest repeat sampling technique to determine the most influential performance factors.

Speakers

Diane Feddema

Principal Software Engineer, AI/ML Performance on RHEL and OpenShift Operator Development, Red Hat

Diane Feddema is a principal software engineer at Red Hat Inc, in the Performance and Scale team. Diane is currently focused on developing and applying machine learning techniques for performance analysis using hardware accelerators, automating these analyses and displaying data in... Read More →

Using a Relative Index of Performance (RIP) to Determine Optimum Configuration Settings Compared to Random Forest Assessment Usi pdf

Wednesday May 11, 2016 10:50am - 11:40am PDT
Georgia B

Monitoring-Benchmarking, Any

10:50am PDT

Apache Yetus - Helping Solve the Last Mile Problem - Allen Wittenauer, Altiscale

In this time of rapidly growing software projects and software capabilities, where it is expected for “software to eat the world,” there is still a huge challenge going from source code to a tested, fully functional release. This is the “last mile problem,” ensuring that vision and coding become real, deployable software. To help address this problem, members of the extended Apache Hadoop/”big data” ecosystem have joined forces to create tools that reduce the burden of pre-commit testing, release note compilation and interface documentation. In this talk, Allen Wittenauer, a PMC member of the Apache Yetus project, will discuss the various components that make up the Yetus toolset, as well as how Apache Hadoop and other projects are using Apache Yetus to improve release quality.

Speakers

Allen Wittenauer

Apache Yetus PMC Member, Apache Software Foundation

Allen Wittenauer has been involved with Apache Hadoop since May 2007, when he was hired by Yahoo! to bring large-scale operational experience to the fledgling project. His work there helped create the basic blueprints that almost all Hadoop deployments follow today. At LinkedIn, his... Read More →

Wednesday May 11, 2016 10:50am - 11:40am PDT
Regency A

New Projects, Intermediate

10:50am PDT

Tailored for Spark - Petr Igrevski, eBay

We went big with Spark at eBay. Let us tell you the story how we built a custom tailored Spark system leveraging cloud and disaggregated storage. Watch us demonstrate our Spark developer experience as we walk you through our custom Spark as a service offering. Come and learn how eBay embraced Spark, how we created a delightful environment for our data developers, and how we use this environment today.

Speakers

Petr Igrevski

Tailored for Spark pdf

Wednesday May 11, 2016 10:50am - 11:40am PDT
Plaza B

Operations-Use Cases, Intermediate

10:50am PDT

Spark After Dark 2.0: Complete End-to-End, Real-time Advanced Analytics, Big Data Reference Pipeline Including Machine Learning, Graph Processing, and Text/NLP Analytics, and Streaming Approximations Using Kafka, Spark Streaming, Spark ML, Spark SQL - Chr

The audience will participate in a live, interactive demo that generates personalized, real-time recommendations using the latest open source streaming and big data processing tools available. We’ll dive deep into not only the architecture and application code, but also the Spark, Cassandra, and ElasticSearch internal codebases that power this awesome combination of technologies. All code and demos are available on Github and DockerHub. Follow the links @ advancedspark.com.

Speakers

Chris Fregly

Solution Architect, AI and machine learning, AWS

Wednesday May 11, 2016 10:50am - 11:40am PDT
Plaza C

Spark, Any

11:50am PDT

Geospatial Querying in Apache Marmotta - Sergio Fernandez, Redlink GmbH

Apache Marmotta provides different means of querying: SPARQL, LDPath, LDP, etc. GeoSPARQL provides an extension to the SPARQL constructs to represent and query geospatial data. The talk will present the development recently done to add GeoSPARQL support in Marmotta, going through the challenges and potential of this new set of features, demoing some of then during the talk.

Speakers

Sergio Fernández

Software Engineer, Redlink GmbH

I'm a Software engineer specialized in innovation, with a focus on Data Architectures. My interests include Distributed Architectures, Data Integration, Linked Data and System Engineering. I've worked as software engineer and project manager in different industries, but always somehow... Read More →

Geospatial Querying in Apache Marmotta pdf

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Plaza A

Geospatial, Intermediate

11:50am PDT

Boost Spark ML Performance with Project Mnemonic - Yanping Wang & Gang Wang, Intel Corp.

Project Mnemonic is an open-source, structured data in-place persistence library for Java-based applications and frameworks. It provides unified interfaces for data manipulation on heterogeneous block/byte-addressable devices, such as DRAM, SSD, NVMe, and Cloud/network devices.
In this presentation, we will first introduce Project Mnemonic and non-volatile Java object model that defines in-memory non-volatile objects which can be directly stored in persistent memory. We will discuss how it can be used to allocate and reclaim heterogeneous memory and storage resources directly on DRAM, NVMe, other persistent memories, and SSD. Then we will show how in-memory non-volatile RDDs can be implemented in Spark. Finally we will present that 2X plus performance boost can be achieved on a Spark ML workload after removing SerDe RDDS, caching hot data, and reducing GC pause time dramatically.

Speakers

Yanping Wang

Software Engineer, Intel Corp

As a Senior Software Performance Engineer at Intel, Yanping has been working on Java and Big Data applications performance for the past 15 years. Currently, she is focusing on improving Big Data applications performance by reducing garbage collection and serialization/de-serialization... Read More →

Gang Wang

Intel

Boost Spark ML Performance with Project Mnemonic pdf

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Georgia A

Machine Learning, Intermediate

11:50am PDT

Experiences Using Apache HTrace (Incubating) in Distributed Web Search - Lewis McGibbney, NASA JPL

Recent developments within the tracing community have brought projects like Apache HTrace (Incubating) into the Apache Incubator opening up the possibility of utilizing tracing logic to better understand distributed applications, systems and systems-of-systems. As many will know, tracing involves a specialized use of logging to record information about a program’s execution. Although many use cases involve the use of tracing within distributed systems such as Hadoop and databases, few tracing experiments belong within the field of large scale, distributed Web search. This presentation will combine comprehensive tracing mechanisms in Apache HTrace (Incubating) with the scalable, flexible crawling architecture presented by Apache Nutch. Key takeaways from this presentation are development and implementation, tracing guidance for your web search stack and future work in this area.

Speakers

Lewis McGibbney

Enterprise Search Technologist III, Jet Propulsion Laboratory

Experiences Using Apache HTrace (Incubating) in Distributed Web Search pdf

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Georgia B

Monitoring-Benchmarking, Intermediate

11:50am PDT

Apache Zeppelin and It’s Pluggable Architecture for Your Data Science Environment - Moon Soo Lee, NFLabs

Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable.

Zeppelin provides pluggable architecture for backend integration, visualization, notebook persistence storage. This presentation will describe how these pluggable architecture works and how your project can leverage them for your data science environment, as well as writing pluggable components and register your component into package registry. Moon soo Lee will demonstrate example use cases of each pluggable components.

Also will discuss about the future roadmap.

Speakers

Moon

cto, NFLabs

Moon soo Lee is a creator for Apache Zeppelin and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and itâ€™s community. His recent focus is growing Zeppelin community and getting adoptions.

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Regency A

New Projects, Any

11:50am PDT

Network DVR Meets Big Data - Stephen Kraiman, ARRIS

Network traffic for massive data ingest systems is often ignored, yet can become a significant cost factor when designing a cluster. Network traffic on a Hadoop and object store based network DVR recorders was simulated with surprising results.

A network DVR is an example of a class of application that generates massive amounts of data on the cluster. This session explores how different implementation models affect the network traffic generated. The presenters will explore the implementation and the simulations results. The presentation will cover a variety of open source technologies including HDFS, Spark, and Kafka.

Speakers

Stephen Kraiman

ARRIS

Stephen Kraiman, Principal Architect at ARRIS, is primarily focused on the design of systems and CDN technology that monetize the storage, management and transport of video over IP networks. Stephen was cofounder for Digital Video Arts Ltd, which was acquired by SeaChange International... Read More →

Network DVR Meets Big Data pdf

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Plaza B

Operations-Use Cases, Beginner

11:50am PDT

Introducing Datasets: Bringing Compile Time Type Checking and Functional Transformations to Spark DataFrames - Holden Karau, IBM

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. DataFrames are a key part of the Spark SQL interface, allowing for relational style transformations and additional optimizations over Spark’s RDDs. Datasets bring much of the power, and compile time type checking, to Spark SQL allowing more developers to benefit from the Catalyst optimizer.

DataFrames allow developers in Apache Spark to access the power of the Catalyst optimizer while continuing to write Scala/Java/Python code. Datasets offer the ability for developers to easily write functional style transformations while still taking advantage of the Catalyst optimizer, compact bit level representation, and so on. Datasets are new in Spark 1.6 and the API will be changing in future versions. This talk will introduce and contrast the APIs.

Speakers

Holden Karau

Developer Advocate, Google

Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning... Read More →

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Plaza C

Spark, Intermediate

12:40pm PDT

Lunch

Wednesday May 11, 2016 12:40pm - 2:00pm PDT
Regency Foyer

2:00pm PDT

Spatial Data Based People/Vhicles Trails Analysis to Support Precision Urban Planning - Yonghua Zeng, IBM

In this session, the presenter will share the experience on how to use the hadoop based big data technology with huge cellular signal data, RFID, and GPS data to analyze and predict people and vehicles trails, to support the precision urban planning. This whole architecture includes,
11) Data ingestion kafka+streaming to collect and preprocessing the real-time generated cellular signal data, RFID and GPS data(200G+ per day)
2) Algorithm model with spatial data computation using Spark core and MLib to analyze and predict people and vehicles trails on the data collected
3) SQL on Hadoop technology to provide the interactive query and analysis for frontend applications
4) Spatial data visualization with GIS and grid technology to render the heatmap, people residency distribution, real traffic road status, OD map etc

Speakers

yonghua zeng

solution architect of big data, IBM

Henry Zeng, senior architect of big data and analytics based in IBM China Development Lab.Henry has more than 10 years experience on data management related products, system and applications development and architecturing, he has two bookspublished in this area. He is now the solution... Read More →

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Plaza A

Geospatial, Intermediate

2:00pm PDT

Combining Machine Learning Frameworks with Apache Spark - Tim Hunter, Databricks, Inc.

Machine Learning (ML) workflows involve a sequence of processing and learning stages. Realistic workflows combine specialized libraries with more general data management workflows.

Apache Spark is well-known as a powerful platform to perform iterative computations required for ML. This talk presents how to combine the strengths of Spark’s ML library (MLlib) with popular packages such as scikit-learn and TensorFlow. Scikit-learn is the de facto standard ML library for Python, and TensorFlow is a library for deep learning recently open-sourced by Google.

We also discuss the improvements of MLlib in Spark 2.0 and the future of MLlib’s APIs. On the roadmap are both more algorithms and features for users, and more utilities and abstractions to aid developers.

Speakers

Tim Hunter

Databricks, Inc.

Tim Hunter is a software engineer at Databricks and contributes to the Spark MLlib project. He has been building distributed Machine Learning systems with Spark since version 0.5, before Spark was an Apache Software Foundation project.

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Georgia A

Machine Learning, Intermediate

2:00pm PDT

HiBench - The Benchmark Suite for Hadoop, Spark and Streaming - Carson Wang, Intel

HiBench is an open sourced and Apache licensed big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilizations. It contains a set of Hadoop, Spark and streaming workloads, including Sort, WordCount, TeraSort, PageRank, Bayes, Kmeans, enhanced DFSIO, etc. It also contains several streaming workloads for Spark Streaming, Storm and Samza. In this presentation, Carson Wang will introduce the features of HiBench and go through how to use HiBench to benchmark different big data frameworks. It will also cover tuning guides for workloads with different characterization.

Speakers

Carson Wang

Carson Wang is a software engineer from Intel big data team. He is an active open source contributor to the Spark and Tachyon projects.

HiBench The Benchmark Suite for Hadoop, Spark and Streaming pdf

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Georgia B

Monitoring-Benchmarking, Beginner

2:00pm PDT

Apache REEF - Stdlib for Big Data - Sergiy Matusevych, Microsoft

Apache REEF (Sergiy Matusevych, Microsoft) - Resource managers like Apache YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. We present Apache REEF, a powerful yet simple framework that helps developers of big data systems to retain fine-grained control over the cloud resources and address common problems of fault-tolerance, task scheduling and coordination, caching, interprocess communication, and bulk-data transfers. We will guide the developers through a simple REEF application and discuss current state of Apache REEF project and its place in the Hadoop ecosystem.

Speakers

Sergiy Matusevych

Sr. Research Engineer, Microsoft

Sergiy is a research engineer at Microsoft Cloud and Information Services Lab, where he is building large scale distributed systems for big data and machine learning. He is a committer to the Apache REEF project. Prior to Microsoft, Sergiy worked as a data research engineer at Yahoo... Read More →

Apache REEF Stdlib for Big Data pdf

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Regency A

New Projects, Intermediate

2:00pm PDT

ODPi and ASF Collaboration: Ask Us Anything! - John Mertic, ODPi & Jim Jagielski, Apache Software Foundation

The Apache Software Foundation (ASF) has long been the champion of open source projects that compose the larger Apache Hadoop ecosystem. ODPi is complementary to those efforts, solely focused on easing integration and standardization for downstream application vendors and end-users that build upon Apache Hadoop®. Since ODPi’s launch in 2015, there has been some confusion around how its work may overlap, or potentially compete, with that of the ASF.

Founding Member and Board Director - Apache Software Foundation, Jim Jagielski, and Director of Program Management for ODPi, John Mertic, will clear up this confusion. During the discussion, attendees will learn how ASF and ODPi are collaborating to accelerate enterprise adoption of Apache Hadoop and big data technologies. There will also be an open Q&A, where attendees can ask about ASF and ODPi projects, their work together, where the big data ecosystem is heading, and anything else that comes to mind.

Speakers

Jim Jagielski

Developer, Uber

John Mertic

Director of Program Management, The Linux Foundation

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Plaza B

Operations-Use Cases, Any

2:00pm PDT

Shared Memory Layer for Spark Applications - Dmitry Setrakyan, GridGain

In this presentation we will talk about the need to share state across different Spark
jobs and applications and several technologies that make it possible, including
Tachyon and Apache Ignite. We will dive into importance of In Memory File Systems,
Shared In-Memory RDDs with Apache Ignite, as well as present a hands on demo
demonstrating advantages and disadvantages of one approach over another. We will
also discuss requirements of storing data off-heap in order to achieve large horizontal
and vertical scale of the applications using Spark and Ignite.

Speakers

Dmitriy Setrakyan

EVP Engineering, GridGain

Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior... Read More →

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Plaza C

Spark, Beginner

3:00pm PDT

Crowd Learning for Indoor Positioning - Thomas Burgess, indoo.rs GmbH

Real-time accurate indoor positioning poses many new possibilities and challenges. At indoo.rs (Austrian based start-up founded in 2010), we enable positioning within mobile applications (Android/iOS) so that users can find themselves and navigate through floor plans. In practice, we estimate location and movement, using motion sensors and comparisons of radio scans (WiFi/iBeacon) to pre-measured reference measurements (fingerprints). We currently are transitioning from using dedicated measurements to an approach that learns and updates references by analyzing data from navigating users.This approach uses the Hadoop ecosystem to combine the output of the IOT network of mobiles and beacons with big data based machine learning and near real-time analytics (including visualizations). The complete solution reduces implementation and maintenance cost of installing indoor location.

Speakers

Thomas Burgess

Director of research, indoo.rs GmbH

Thomas is the CRO of indoo.rs and leads its research efforts since 2012. Earlier, he did his PhD in particle physics at Stockholm University for the AMANDA/IceCube neutrino telescopes, and worked as a postdoctoral researcher at University of Bergen for the ATLAS experiment at the... Read More →

Crowd Learning for Indoor Positioning pdf

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Plaza A

Geospatial, Intermediate

3:00pm PDT

Real-world Analytics with Solr Cloud and Spark - Johannes Weigend, QAware GmbH

Apache Solr is a distributed NoSQL database with impressive search capabilities. Apache Spark makes M/R faster and richer. In this code-intense session shows how to combine both to solve real-time search and processing problems. The demos feature a portable Solr Cloud / Spark Cluster based on Intel NUC Hardware.

Speakers

Johannes Weigend

CTO, QAware GmbH

Johannes works as a software architect with Java since 1999 and was honoured as "Java Rockstar" at JavaOne 2015. He is a lecturer at the University of Applied Sciences in Rosenheim, Germany and technical director at QAware, a decorated software engineering company located in Munich... Read More →

Real world Analytics with Solr Cloud and Spark pdf

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Georgia A

Machine Learning, Intermediate

3:00pm PDT

Monitoring in a Distributed World - Felix Massem, codecentric AG

The IT infrastructure for distributed applications is getting bigger and more complex every day. Through this, the pure mass of observed events is growing. To be able to ensure a safe IT operation, we also need a distributed and scalable monitoring architecture to evaluate these events. This session wants to show how to build an architecture upon open source software.
Starting with some basics on monitoring IT infrastructure and applications, we will have a look on some of the key words like monitoring, alerting, diagnostic and reporting. Based on this, we will start to build up a monitoring architecture.
We will elaborate on and integrate the following modules: log file shipping and analysis (logstash), system monitoring (collectD), event storage (elasticsearch), metric generator and storage (statsd and graphite) as well as different dashboards (grafana, seyren, kibana).

Speakers

Felix Massem

codecentric AG

Felix Massem works as a consultant for codecentric AG. His main focus is in the area of Continuous Delivery and technologies around infrastructure as code and log analysis. Beside this, he is most interested in topics like DevOps, Data Minig and Big Data technologies. As an author... Read More →

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Georgia B

Monitoring-Benchmarking, Beginner

3:00pm PDT

Apache S2Graph: A Large Scale Distributed Graph Database - Doyung Yoon & Hyunsung Jo, Kakao

S2Graph, the new Apache incubator project, is a distributed and scalable OLTP graph database that supports fast traversal of extremely large graph data. S2Graph provides a set of fully asynchronous APIs for data management operations and fast breadth-first-search querying on a property graph model.
S2Graph has not only been used as one of the Kakao`s main storage managing more than a trillion edges with 3 billion real-time and 50 billion batch updates daily, but also provided an common API for processing 70k social graph queries per second for dozens of successful mobile services.
Maintain large mutable graphs, merge real-time data with batch, provide BFS traverse on them are difficult technical problems and S2Graph successfully solved them, so we’d like to introduce our methodology and architecture. Also We will introduce use cases and feature updates since last ApacheCon

Speakers

Hyunsung Jo

Kakao

Seoul-based developer interested in large scale data systems and cloud computing. Currently, working as a data systems developer at Kakao Corp., Korea with open source projects such as Apache S2Graph (incubating) and Druid among others. Previous work experience include software... Read More →

Doyung Yoon

Software Engineer, Kakao

Doyung works in a distributed graph database team at Kakao as software engineer, where his focus is on performance and usability. He developed Apache S2Graph, an open-source distributed graph database, and has previously presented it at ApacheCon BigData Europe and ApacheCon BigData... Read More →

S2Graph Apache Big Data pdf

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Regency A

New Projects, Beginner

3:00pm PDT

Scylla: A Revolutionary Design for NoSQL Performs at 1.8M TPS/node - Don Marti & Tzach Livyatan, ScyllaDB

Scylla is a new NoSQL database, compatible with Apache Cassandra, that is capable of a 10x improvement in throughput on the same hardware, with predictable low latency that dramatically improves the performance of analytics originally developed for Cassandra. The database is now in use in production and in pilot projects internationally.

Scylla applies kernel programming techniques to a horizontally scalable NoSQL design to achieve extreme performance improvements and the elimination of garbage collection pauses. The Scylla design is based on a modern shared-nothing approach. A new architecture for the NoSQL server is necessary because of new growth in, and limitations of, modern server hardware. As CPU core counts continue to grow, along with the raw speed of networking and storage devices available on a modern system, software design approaches that were valid and safe even a few years ago are no longer sustainable. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC.

With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Scylla enables faster cluster scaling, more overhead to handle complex queries, and the power to do complex analytics tasks at the same time as routine administration operations.

Speakers

Tzach Livyatan

VP Product, Scylla

Tzach Livyatan has a B.A. and MSc in Computer Science (Technion, Summa Cum Laude), and has had a 15 year career in development, system engineering and product management. In the past he worked in the Telecom domain, focusing on carrier-grade systems, signalling, policy and charging... Read More →

Don Marti

ScyllaDB

Don Marti has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and... Read More →

NoSQL Goes Native Scylla at Apache Big Data pdf

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Plaza B

NoSQL, Any | Operations-Use Cases, Any

3:00pm PDT

Time Series Processing with Apache Spark - Josef Adersberger, QAware GmbH

A lot of data is best represented as time series: Operational data, financial data and even in general-purpose DWHs the dominant dimension is time. The area of time series databases is growing rapidly but the support in Spark to process and analyze time series data is still in the early stages. We present Chronix Spark which provides a mature TimeSeriesRDD implementation for fast retrieval and complex analysis of time series data. Chronix Spark is open source software and battle-proved at a big german car manufacturer and a german telco. We show how we‘ve used Chronix Spark in a real-life project and provide some benchmarks how it has outperformed common time series databases like OpenTSDB, KairosDB and InfluxDB. We lift the curtain and deep-dive into the internals how we‘ve achieved this.

Speakers

Josef Adersberger

CTO, QAware

Josef Adersberger is co-founder & CTO of QAware, a German custom software development company and CNCF silver member. He studied computer science in Rosenheim and Munich and holds a doctoral degree in software engineering. He is currently responsible for a large-scale cloud migration... Read More →

adersberger timeseries analysis spark pdf

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Plaza C

Spark, Beginner

3:50pm PDT

Coffee Break

Wednesday May 11, 2016 3:50pm - 4:10pm PDT
Regency Foyer

4:10pm PDT

Distributed Machine Learning with Apache Mahout - Suneel Marthi, Red Hat

Data Science tools like R,Scikit-Learn as they offer a convenient and familiar syntax for analysis tasks. However, these systems are limited to operating serially on data sets that can fit on a single node. Mahout-Samsara is a linear algebra environment that offers both an easy-to-use Scala DSL and efficient distributed execution for linear algebra operations.In this talk, we will look at Mahout’s distributed linear algebra capabilities and build a simple ML algorithm using the Samsara DSL. We’ll be demonstrating this using Apache Flink as the backend distributed engines.ML practitioners will come away from this talk with a better understanding of how Samsara’s linear algebra environment can help simplify developing highly scalable ML algorithms by focusing solely on the declarative specification of the algorithm while not worrying about the details of scalable distributed implementation

Speakers

Suneel Marthi

AWS

Suneel is a Member of Apache Software Foundation and is a Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams. He's presented in the past at Flink Forward, Hadoop Summit, Berlin Buzzwords, Machine Learning Conference, Big Data Tech Warsaw and Apache Big Data.

Distributed Machine Learning with Apache Mahout pdf

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Georgia A

Machine Learning, Intermediate

4:10pm PDT

Effective HBase Healthcheck and Troubleshooting - Jayesh Thakrar, Conversant

We all know of HBase as a robust, resilient, scalable, and performant big data datastore. Once configured well, it can run hands-off for months without need for any maintenance or care-and-feed. The only occassional attention needed is hardware maintenance and system troubleshooting. Since an HBase cluster is often made up of several servers and the system could be on "auto-pilot", its the applications that may notice problems first when they occur. At those times, identifying and resolving the root-cause or symptom needs to be done quickly.

Other than HDFS itself, HBase is probably the oldest and most mature component of the Hadoop ecosystem and it is budled with a number of tools and utilities. This presentation will cover how to effectively make them part of your troubleshooting toolbox as well as to formulate your own key performance and health indicators.

Speakers

Jayesh Thakrar

Sr. Software Engineer, Conversant

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Georgia B

Monitoring-Benchmarking, Beginner

4:10pm PDT

Graph Processing with Apache TinkerPop - Jason Plurad, IBM

Graphs are growing in popularity, but the landscape is becoming a hairball. Learn how to unravel it with the Apache TinkerPop graph computing framework and Gremlin, a functional, data flow language for traversing graphs. This session helps you distinguish between OLTP and OLAP graph processing as well as how to bridge the gap between graph databases and graph engines. We will offer TinkerPop alternatives for effective graph processing that go beyond Spark GraphX. We will also cover how to spin up a graph development environment quickly with Apache Ambari.

Speakers

Jason Plurad

Software Engineer, IBM

Jason Plurad is a software engineer from IBM Open Technology. He is a PMC member and committer on Apache TinkerPop, an open source graph computing framework. Jason engages in various development (including front end, web tier, NoSQL databases, and big data analytics) and promotes... Read More →

Graph Processing with Apache TinkerPop pdf

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Regency A

New Projects, Intermediate

4:10pm PDT

Apache HBase: Overview and Use Cases - Sean Busbey, Cloudera Inc

NoSQL databases are critical in building Big Data applications. Apache HBase, one of the most popular NoSQL databases, is used by Facebook, Apple, eBay and hundreds of other enterprises to store, analyze and profit from their petabyte-scale volume of data. This tutorial, using hands-on session with Apache HBase, will explain basic concepts of non-relational databases. Then we’ll explore some commonly seen big data usage patterns in industry, and when & how to use Apache HBase (or other better suited NoSQL database).

Speakers

Sean Busbey

Cloudera

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Plaza A

Operations-Use Cases, Beginner

4:10pm PDT

Data Science for the Datacenter: Analyzing Logs with Apache Spark - William Benton, Red Hat, Inc

Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured.

In this session, Will Benton will introduce the log processing domain and give you practical advice for using Apache Spark to analyze log data, including data engineering techniques to impose structure on disparate log sources; data science approaches to detect infrastructure failures; language-processing techniques to characterize the text of log messages; best practices for tuning Spark and using newer Spark features; and how to visualize your results. You’ll learn from Benton’s experience developing applications that analyze the vast log data generated within Red Hat’s network and leave well-prepared to analyze your own logs.

Speakers

William Benton

Manager, Software Engineering and Sr. Principal Engineer, Red Hat, Inc

William Benton leads a team of data scientists and engineers at Red Hat, where he has applied machine learning to problems ranging from forecasting cloud infrastructure costs to designing better cycling workouts. His current focus is investigating the best ways to build and deploy... Read More →

Data Science for the Datacenter Analyzing Logs with Apache Spark pdf

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Plaza B

Operations-Use Cases, Intermediate

4:10pm PDT

Secure Spark Shuffle: A Fast and Convenient Approach Using Chimera - Cheng Xu, Intel

Shuffle is the key process in Spark computing model. It’s very sensitive to performance. Since the frequent crimes and accidents arising from security, data encryption becomes more and more important for an enterprise ready product. In this talk, we will talk about how we use Chimera to secure the shuffle data. Chimera is a cryptography library optimized with AES-NI (Advanced Encryption Standard New Instructions). It provides Java API for both cipher level and Java stream level. It originates from Intel Diceros and Hadoop encryption at rest. It limits the performance impacts using hardware acceleration and helps users get rid of native issues used by native code. In this presentation, we will also show the performance results after enabling the shuffle encryption in Spark.

Speakers

Cheng Xu

Senior Software Engineer, Intel

I am a software engineer from Intel. I am now working on Apache Hive project, Apache Parquet and Apache Spark Project. I am a committer of Apache HIVE project. Now I am focussed on Spark Authorization specially in Spark SQL component and the performance improvements in Apache Parquet... Read More →

Secure Spark Shuffle A Fast and Convenient Approach Using Chimera pdf

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Plaza C

Spark, Intermediate

5:10pm PDT

Action! Does Your Project Want to Join Geospatial? - George Percival, Open Geospatial Consortium (OGC)

This will be a capstone discussion of the geospatial track of Apache BD. Now that we have heard several excellent presentations about different project implementing geospatial functions, is there interest in future activities?

Topics for Discussion

Is there interest in coordination across projects?
Is there interest in coordination outside of Apache?

Come join us! Future events on Big Geo Data

FOSS4G in Bonn, August, 24 – 26 - confirmed
Apache in Seville later this year - To be discussed.

Speakers

George Percivall

CTO, Chief Engineer, OGC

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Plaza A

Geospatial

5:10pm PDT

Less Is More: Doubling Storage Efficiency with HDFS Erasure Coding - Zhe Zhang, LinkedIn & Kai Zheng, Intel

Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting expensive: the default 3x replication scheme incurs a 200% storage overhead. Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.

In this talk we will introduce the design and implementation of HDFS-EC, and recommended use cases. We will also provide preliminary performance results. Equipped with the Intel ISA-L library, HDFS-EC has largely eliminated the computational overhead in codec calculation. Under sequential I/O workloads, it achieves twice the throughput compared with 3x replication, by performing striped I/O to multiple DataNodes in parallel.

Speakers

Zhe Zhang

Microsoft

Zhe Zhang is a software engineer at LinkedIn working on Hadoop. He’s an Apache Hadoop Committer and author of HDFS Erasure Coding. Before joining LinkedIn in Feburary 2016 Zhe was an engineer in Cloudera HDFS team. Prior to that he worked at the IBM T. J. Watson Research Center... Read More →

Kai Zheng

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Georgia B

Monitoring-Benchmarking, Intermediate

5:10pm PDT

The Many Faces of Apache Ignite - David Robinson, IBM

Come explore the capabilities of Apache Ignite in-memory grid through a series of experimental use cases to improve the performance and behavior of an existing graph package. Learn how you can improve your data processing and analytics through the judicious use of a memory grid. We will cover topics like in-memory RDD capabilities with Apache TinkerPop, how Ignite can provide a power assist to Apache Kafka for data streaming, and more.

Speakers

David Robinson

software engineer, -

David Robinson is a software engineer with IBM. David works in IBM’s Open Technologies group contributing to open source projects such as Apache TinkerPop and Titan. He is often engaged with product teams and customers developing solutions around open technology in the big data... Read More →

The Many Faces of Apache Ignite pdf

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Regency A

New Projects, Beginner

5:10pm PDT

Data Management at Scale - Tom Barber, Meteroite Consulting

Apache OODT is relatively easy to get up and running with the RADiX distribution but how do you administer it at scale?

Managing a data management cluster can be daunting, especially when its distributed around the globe in various data centres. We’ll take a look at options for large scale distributed roll outs of Apache OODT across multiple continents and how to connect, support and administer them, maximising the throughput of the system and ensuring users have access to all the data they require.

Container technology has drastically altered the DevOps landscape, using service orchestration tools and Apache MESOS to maintain your cluster can make managing OODT relatively easy and infinitely scalable and also how to connect it to other data services. Find out more in a live (seat of the pants) demo.

Speakers

Tom Barber

Technical Director, Spicule LTD

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Plaza B

Operations-Use Cases, Intermediate

5:10pm PDT

Mining Public Datasets Using Apache Zeppelin (incubating) and Spark - Alexander Bezzubov, NFLabs

There are a lot of public datasets available in the wild and the number is growing. In meantime, ASF provides a plethora of free tools for any practitioner to build up on. In this talk Alexander will show how to levirage 2 of them, Zeppelin and Spark, for exploratory data anaytics and building a data product over two real datasets CommonCrawl http://commoncrawl.org and GithubArchive https://www.githubarchive.org

Speakers

Alexander Bezzubov

Software Engineer, NFLabs

Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.

Mining Public Datasets pdf

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Plaza C

Spark, Beginner

6:30pm PDT

Evening Event at Steamworks Brewing Company

Join ApacheCon attendees at Steamworks Brewing Company, Canada's only steam generated brewery, for a night of local brews, hors d'oeuvres, networking and fun.

Wednesday May 11, 2016 6:30pm - 9:00pm PDT
Steamworks Brewing Company

7:30am PDT

Breakfast

Thursday May 12, 2016 7:30am - 9:00am PDT
Regency Foyer

8:00am PDT

Registration

Thursday May 12, 2016 8:00am - 9:00am PDT
Georgia Foyer

9:00am PDT

Getting Started with Apache OODT - Tom Barber, Meteroite Consulting (Additional Fee)

With data becoming more and more prevalent along with a requirement to store it managing it becomes a ever greater problem. How can Apache OODT fill that void?

Apache OODT is a distributed data processing and management platform. In this talk we’ll go through installation and configuration. How to start a project, deploy and test a project. We’ll run through the various components you’re likely to use, how to customise them and make your users embrace data management. We’ll also take a look at workflows, resources and how to build simple workflows. During this presentation we’ll also connect Apache OODT to a number of different data sources to demonstrate data ingestion and metadata capture. Finally, of course it’s all well and good capturing data, but how do you get data out to your end users? We’ll go through the options for data extraction and dissemination to end users.

Speakers

Tom Barber

Technical Director, Spicule LTD

Thursday May 12, 2016 9:00am - 12:00pm PDT
Plaza A

Tutorial, Beginner

9:00am PDT

Getting Started with Machine Learning & Spark - Holden Karau, IBM (Additional Fee)

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. Apache Spark ships with built in libraries for a variety of purposes including: SQL, Streaming, Graph Analysis, and Machine Learning. This talk will focus on how to use Spark for Machine Learning.

Apache Spark has two APIs for Machine Learning, the newer of which is focused on creating Machine Learning Pipelines. This talk will explore a simple classification problem in both of the APIs, followed by a tour of some of the different machine learning models. We will then talk about loading/saving models and the challenges faced when attempting to construct a real-time serving solution from Spark ML’s models. From their we will explore some of the performance improvement work being done inside of Spark for improving machine learning.

Speakers

Holden Karau

Developer Advocate, Google

Thursday May 12, 2016 9:00am - 12:00pm PDT
Constable

Tutorial, Intermediate

9:00am PDT

Interactive Data Science from Scratch with Apache Zeppelin and Apache Spark - Felix Cheung (Additional Fee)

How do you find the needle in the haystack?

With Big Data, finding insight is a big problem. Visualization and exploratory analysis help convert on insights and Apache Zeppelin (incubating) is an essential tool for that.

In this tutorial, Felix Cheung will introduce you to Apache Zeppelin, and provide step-by-step guides to get you up-and-running with Apache Zeppelin to run Big Data analysis with Apache Spark.

This is going to be a heavily hands-on session, no previous experience with Zeppelin, Data Science, or Statistics necessary. Bring your laptop - attendees are expected to be able to handle some software installation steps.

You can view the materials here:
http://www.slideshare.net/felixcss/interactive-data-science-from-scratch-with-apache-zeppelin-and-apache-spark

Speakers

Felix Cheung

Engineering Manager, Uber

Felix started in the big data space about 5 years ago with the then state-of-the-art MapReduce. Since then, he (re-)built Hadoop cluster from metal more times than he would like, created a Hadoop “distro” from two dozens or so projects into .rpm/.deb, and kicked off clusters in... Read More →

Thursday May 12, 2016 9:00am - 12:00pm PDT
Lord Byron

Tutorial, Beginner

9:00am PDT

Mission to NARs with Apache NiFi - Aldrin Piri, Hortonworks (Additional Fee)

Mission to NARs with Apache NiFi (Aldrin Piri, Hortonworks) - Apache NiFi is both a powerful application and platform for creating and developing powerful and reliable dataflows to process and distribute data. During the course of this tutorial, Aldrin will showcase creating a dataflow using out of the box components and determining where custom components can help create more robust and expressive dataflows. Aldrin will illustrate the process and ease of creating new components and bundles (NiFi Archives, or NARs) for Apache NiFi allowing developers to focus on core functionality while getting the framework features inclusive of concurrency, provenance, metrics, and the associated UI components for free.

Speakers

Aldrin Piri

Hortonworks

Aldrin is a Senior Member of Technical Staff at Hortonworks working on Hortonworks Data Flow (HDF). Following the open source release of NiFi by the NSA in late 2014, Aldrin has become a PMC member and committer for Apache NiFi and helps run the DMV Apache NiFi Users Group. Long time... Read More →

Thursday May 12, 2016 9:00am - 1:00pm PDT
Kensington

Tutorial, Intermediate

12:00pm PDT

Coffee Break

Thursday May 12, 2016 12:00pm - 1:00pm PDT
Regency Foyer