Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

New Projects [clear filter]
Monday, May 9

10:40am PDT

The Evolution of Apache Kylin: Realtime and Plugin Architecture in Kylin2 - Luke Han, Apache Kylin
After successful MOLAP implementation, Apache Kylin’s evolution is turning to enable realtime analysis, and also to support different input and output data sources, leverage different computing engines. In Apache Kylin2, the new designed architecture support plug-able adaptor from Hive/SparkSQL/Kafka and others, and also possible to store data into other storage system rather than HBase, like Kudu. During this session, will introduce the detail of such changes and coming features. Also will cover one production use case with streaming supported already
1. Apache Kylin Overview
2. Plugin Architecture
3. Streaming Cubing
4. Realtime Analysis
5. Use Cases.

avatar for Luke Han

Luke Han

Co-Founder & CEO, Kyligence
Luke Han is Co-Founder and CEO at Kyligence, and the co-creator and VP of the open source Apache Kylin project, who contributing his passion to driving the project's strategy, roadmap and product design. For past few years he has been working on growing Apache Kylin's community... Read More →

Monday May 9, 2016 10:40am - 11:30am PDT
Regency E

11:40am PDT

Apache Trafodion Brings Operational Workloads to Hadoop - Rohit Jain, Esgyn
Apache Trafodion is a world class Transactional SQL RDBMS running on HBase/Hadoop, currently in Apache incubation.

In this talk we will discuss:
• How operational workloads are different from BI and analytical workloads
• The operational (OLTP & Operational Data Store) use cases Trafodion addresses
• Why Trafodion is the right solution for these use cases. That is, what is the recipe for a world class database engine, and how Trafodion implements the ingredients that make up that recipe:
1. Time, money, and talent!
2. World class query optimizer
3. World class parallel data flow execution engine
4. World class distributed transaction management system
• Other important aspects such as performance, scale, availability, and future directions


Rohit Jain

CTO, Esgyn
Rohit Jain is Co-Founder and CTO at Esgyn, an open source database company. Rohit provided the vision behind Apache Trafodion, an enterprise-class MPP SQL Database for Big Data, donated to the Apache Software Foundation by HP in 2015. A veteran database technologist over the past... Read More →

Monday May 9, 2016 11:40am - 12:30pm PDT
Regency E

2:00pm PDT

Data Science Applied: A Utilities Sector Case Study - Bram Steurtewagen
Automated Metering Infrastructure (AMI) is gaining traction within the utilities sector and has brought with it numerous improvements in all related fields. Specifically in tariff setting and demand response models, classification of smart meter readings into load profiles helps find the right segments to target. The methodology explained in this tutorial combines commercial, government and open data with the internal company data to accurately predict the load profile of a new customer using high performing classification models in both R and PySpark. Load profiles are generated using a clustering algorithm and are subsequently used as the dependent variable in our classification model. The results of this model are then scored and interpreted in a business context. During the entire process, possible business hurdles will be identified and solutions will be offered.

avatar for Bram Steurtewagen

Bram Steurtewagen

Ghent University
Bram Steurtewagen received his M.Sc. degree in Commercial Engineering (2013) and his M.Sc. degree in Marketing Analytics (2014) from Ghent University in Belgium. Since then, he has been pursuing a PhD in Marketing Analytics at the Faculty of Economics and Business Adminstration of... Read More →

Monday May 9, 2016 2:00pm - 2:50pm PDT
Regency E

3:00pm PDT

Introduction to Apache Kudu (Incubating) for Timeseries Storage - Dan Burkert, Cloudera
Apache Kudu (Incubating) is a new columnar storage engine for the Hadoop
ecosystem. Kudu is designed to handle the stresses of the modern analytics
pipeline, enabling real time ingestion with instant querying capability at

This talk will introduce Kudu, giving an overview of the architecture and
internals. After discussing what makes Kudu different than existing Hadoop
storage platforms, we will discuss why Kudu is particularly well suited for
storing and querying large timeseries datasets. The talk will conclude by
demonstrating a realtime timeseries analytics dashboard powered by Kudu.


Dan Burkert

Dan Burkert is a software engineer at Cloudera and committer on Apache Kudu (Incubating). Prior to joining Cloudera, Dan worked on data processing pipelines for machine learning, search, and analytics. Dan received his bachelor’s degree from the University of Virginia.

Monday May 9, 2016 3:00pm - 3:50pm PDT
Regency E

4:10pm PDT

Everyone Plays: Collaborative Data Science with Zeppelin - Trevor Grant, Market6
Data Science is best played as a team sport. Zeppelin facilitates this collaboration via a web based notebook interface to state-of-the-art big data (Flink, Spark, Hive, Cassandra, and many more), with custom visualization powered by AngularJS built in. Markdown allows for rich notation in-line with the code. Work can be shared seamlessly across the organization. Further, interactive visualizations can be shared with business analysts and sales reps, great for prototyping and proof of concepts. But the collaboration also runs between technologies, by leveraging the Zeppelin Context sharing variables BETWEEN contexts. E.g. the results of a Flink paragraph can be passed to a Spark paragraph; the best tool can be used for the job can be used at each step in analytics pipeline and a data scientist who loves Scala Flink can easily work with a data scientist who loves pyspark.

avatar for Trevor Grant

Trevor Grant

Open Source AI / IoT Evangelist, IBM
Trevor is an open source evangelist at IBM in Watson IoT. He is also a PMC on the Apache Mahout, Apache Streams, and Apache Community Development projects. He has spoken at conferences and Meetups internationally.

Monday May 9, 2016 4:10pm - 5:00pm PDT
Regency E

5:10pm PDT

The New Time Series Kid on the Block - Florian Lautenschlager, QAware GmbH
There is a new open source time series database on the block that allows one to store billions of time series points and access them within a few milliseconds.
Chronix [1] is a young but mature open source time series database that catches a compression rate of 98% compared to data in CSV files while an average query took 21 milliseconds. Chronix is built on top of Apache Solr [2], a bulletproof NoSQL database with impressive search capabilities. Chronix relies on Solr plugins and everyone who has a Solr running can create a new Chronix core within a few minutes.
In this session we show how Chronix achieves its efficiency in both by means of an ideal chunking, by selecting the best compression technique, by enhancing the stored data with pre-computed attributes, and by specialized time series query functions.

[1] http://chronix.io
[2] http://lucene.apache.org/solr/

avatar for Florian Lautenschlager

Florian Lautenschlager

Engineer, QAware GmbH
Florian Lautenschlager is an architect at QAware GmbH Germany. He is also a guest researcher at FAU Erlangen-Nürnberg. Florian studied Computer Science at the University of Applied Science Rosenheim. He works on a research project called Design for Diagnosability in which time series... Read More →

Monday May 9, 2016 5:10pm - 6:00pm PDT
Regency E
Wednesday, May 11

10:50am PDT

Apache Yetus - Helping Solve the Last Mile Problem - Allen Wittenauer, Altiscale
In this time of rapidly growing software projects and software capabilities, where it is expected for “software to eat the world,” there is still a huge challenge going from source code to a tested, fully functional release. This is the “last mile problem,” ensuring that vision and coding become real, deployable software. To help address this problem, members of the extended Apache Hadoop/”big data” ecosystem have joined forces to create tools that reduce the burden of pre-commit testing, release note compilation and interface documentation. In this talk, Allen Wittenauer, a PMC member of the Apache Yetus project, will discuss the various components that make up the Yetus toolset, as well as how Apache Hadoop and other projects are using Apache Yetus to improve release quality.

avatar for Allen Wittenauer

Allen Wittenauer

Apache Yetus PMC Member, Apache Software Foundation
Allen Wittenauer has been involved with Apache Hadoop since May 2007, when he was hired by Yahoo! to bring large-scale operational experience to the fledgling project. His work there helped create the basic blueprints that almost all Hadoop deployments follow today. At LinkedIn, his... Read More →

Wednesday May 11, 2016 10:50am - 11:40am PDT
Regency A

11:50am PDT

Apache Zeppelin and It’s Pluggable Architecture for Your Data Science Environment - Moon Soo Lee, NFLabs
Apache Zeppelin (incubating) is interactive data analytics environment for distributed data processing system. It provides beautiful interactive web-based interface, data visualization, collaborative work environment and many other nice features to make your data analytics more fun and enjoyable.

Zeppelin provides pluggable architecture for backend integration, visualization, notebook persistence storage. This presentation will describe how these pluggable architecture works and how your project can leverage them for your data science environment, as well as writing pluggable components and register your component into package registry. Moon soo Lee will demonstrate example use cases of each pluggable components.

Also will discuss about the future roadmap.

avatar for Moon


cto, NFLabs
Moon soo Lee is a creator for Apache Zeppelin and a Co-Founder, CTO at NFLabs. For past few years he has been working on bootstrapping Zeppelin project and it’s community. His recent focus is growing Zeppelin community and getting adoptions.

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Regency A

2:00pm PDT

Apache REEF - Stdlib for Big Data - Sergiy Matusevych, Microsoft
Apache REEF (Sergiy Matusevych, Microsoft) - Resource managers like Apache YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. We present Apache REEF, a powerful yet simple framework that helps developers of big data systems to retain fine-grained control over the cloud resources and address common problems of fault-tolerance, task scheduling and coordination, caching, interprocess communication, and bulk-data transfers. We will guide the developers through a simple REEF application and discuss current state of Apache REEF project and its place in the Hadoop ecosystem.

avatar for Sergiy Matusevych

Sergiy Matusevych

Sr. Research Engineer, Microsoft
Sergiy is a research engineer at Microsoft Cloud and Information Services Lab, where he is building large scale distributed systems for big data and machine learning. He is a committer to the Apache REEF project. Prior to Microsoft, Sergiy worked as a data research engineer at Yahoo... Read More →

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Regency A

3:00pm PDT

Apache S2Graph: A Large Scale Distributed Graph Database - Doyung Yoon & Hyunsung Jo, Kakao
S2Graph, the new Apache incubator project, is a distributed and scalable OLTP graph database that supports fast traversal of extremely large graph data. S2Graph provides a set of fully asynchronous APIs for data management operations and fast breadth-first-search querying on a property graph model.
S2Graph has not only been used as one of the Kakao`s main storage managing more than a trillion edges with 3 billion real-time and 50 billion batch updates daily, but also provided an common API for processing 70k social graph queries per second for dozens of successful mobile services.
Maintain large mutable graphs, merge real-time data with batch, provide BFS traverse on them are difficult technical problems and S2Graph successfully solved them, so we’d like to introduce our methodology and architecture. Also We will introduce use cases and feature updates since last ApacheCon

avatar for Hyunsung Jo

Hyunsung Jo

Seoul-based developer interested in large scale data systems and cloud computing. Currently, working as a data systems developer at Kakao Corp., Korea with open source projects such as Apache S2Graph (incubating) and Druid among others. Previous work experience include software... Read More →
avatar for Doyung Yoon

Doyung Yoon

Software Engineer, Kakao
Doyung works in a distributed graph database team at Kakao as software engineer, where his focus is on performance and usability. He developed Apache S2Graph, an open-source distributed graph database, and has previously presented it at ApacheCon BigData Europe and ApacheCon BigData... Read More →

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Regency A

4:10pm PDT

Graph Processing with Apache TinkerPop - Jason Plurad, IBM
Graphs are growing in popularity, but the landscape is becoming a hairball. Learn how to unravel it with the Apache TinkerPop graph computing framework and Gremlin, a functional, data flow language for traversing graphs. This session helps you distinguish between OLTP and OLAP graph processing as well as how to bridge the gap between graph databases and graph engines. We will offer TinkerPop alternatives for effective graph processing that go beyond Spark GraphX. We will also cover how to spin up a graph development environment quickly with Apache Ambari.


Jason Plurad

Software Engineer, IBM
Jason Plurad is a software engineer from IBM Open Technology. He is a PMC member and committer on Apache TinkerPop, an open source graph computing framework. Jason engages in various development (including front end, web tier, NoSQL databases, and big data analytics) and promotes... Read More →

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Regency A

5:10pm PDT

The Many Faces of Apache Ignite - David Robinson, IBM
Come explore the capabilities of Apache Ignite in-memory grid through a series of experimental use cases to improve the performance and behavior of an existing graph package. Learn how you can improve your data processing and analytics through the judicious use of a memory grid. We will cover topics like in-memory RDD capabilities with Apache TinkerPop, how Ignite can provide a power assist to Apache Kafka for data streaming, and more.


David Robinson

software engineer, -
David Robinson is a software engineer with IBM. David works in IBM’s Open Technologies group contributing to open source projects such as Apache TinkerPop and Titan. He is often engaged with product teams and customers developing solutions around open technology in the big data... Read More →

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Regency A