Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Spark [clear filter]
Tuesday, May 10

9:00am PDT

Random Forest Clustering with Apache Spark - Erik Erlandson, Red Hat, Inc.
Analytics applications often boil down to grouping objects into two or more clusters having similar elements. Defining what “similar” means can be surprisingly difficult when data elements have many columns or dimensions. Having tools at hand to generate quality clusters from high-dimensional data greatly increases the variety of applications that can successfully leverage clustering.

In this presentation, Erik Erlandson will introduce the basic principles and advantages of Random Forest learning models and Random Forest clustering. He will explain how to build up an implementation of Random Forest clustering in the Apache Spark analytics framework, based on the Spark MLLib Random Forest modeling API.

The presentation will include examples of Random Forest clustering applied to VM installed-package profiles and a discussion of practical issues encountered along the way.

avatar for Erik Erlandson

Erik Erlandson

Principal Software Engineer, Red Hat

Tuesday May 10, 2016 9:00am - 9:50am PDT
Plaza C

10:00am PDT

Clickstream Analysis with Apache Spark - Andreas Zitzelsberger, QAware GmbH
On large-scale web sites, users leave thousands of traces every second. Businesses need to process and interpret these traces in real-time to be able to react on the behavior of their users.
In this talk, Andreas will show a real world example of the power of a modern open-source stack.
He will walk you through the design of a real-time clickstream analysis PAAS solution based on Apache Spark, Kafka, Parquet and HDFS, explain our decision making and present our lessons learned.

avatar for Andreas Zitzelsberger

Andreas Zitzelsberger

Principal Software Architect, QAware GmbH
Andreas is Principal Software Architect at QAware, an independent cloud native software manufacturer that has been repeatedly awarded Best IT Workplace in Germany. His focus is cloud native computing in all its glory. He is responsible for the heavy lifting at a large-scale cloud... Read More →

Tuesday May 10, 2016 10:00am - 10:50am PDT
Plaza C

11:20am PDT

Spark Cyborgs - Deep Integration of Spark with Parallel Relational Engines - Torsten Steinbach & Gustavo Arocena, IBM
In this session we describe a family of hybrid engines that result from a deep two-way integration between Spark and parallel RDBMSs. This integration differs from projects like Hive on Spark, that leverage Spark purely as an execution framework. It also goes beyond what’s possible with the current version of the DataSources API in terms of leveraging the capabilities of the storage backend. In our presentation you will learn about four essential building blocks of the hybrid engines:
1. Derive DataFrame partitioning implicitly from parallel RDBMS partitioning
2. Colocation and efficient data movement between Spark and RDBMS processes
3. Hybrid queries by augmenting parallel RDBMS with Spark
4. Spark machine learning integrated in RDBMS for relational data


Gustavo Arocena

Big Data Architect, IBM
Gustavo Arocena is a Big Data Architect at the IBM Toronto Lab, with more than 10 years of experience in database technology and language processing. Recently he has lead the design and implementation of several components of the Big SQL engine, including the Hive-compatible IO layer... Read More →

Torsten Steinbach

Torsten has been a software architect for database technology in IBM for many years. He lead product development for DB2 performance management tooling, Netezza workload management and in-database analytics. Currently he works on IBM’s cloud data warehouse dashDB and it’s integrated... Read More →

Tuesday May 10, 2016 11:20am - 12:10pm PDT
Plaza C

2:00pm PDT

On the Fly Retraining of Predictive Analytical Models Using Spark Streaming: An Equity-price Direction Prediction Case Study - Tijl Carpels, Ghent University
FinTech companies are facing the challenge of predicting the direction of equity prices. During this study we have used algorithms provided in Spark Mllib to address this problem. Due to the characteristics of the equity market this happens in a streaming environment requiring us to continuously monitor the performance of the predictive model. When the performance drops below a certain threshold we trigger a batch training of the model. We made a proof of concept using different open-source tools. (Apache Spark and Spark-notebook)

avatar for Tijl Carpels

Tijl Carpels

Doctoral Researcher - Data Scientist, Ghent University
Tijl Carpels received his M.Sc. degree in Business Engineering (major: Finance) in 2015 after writing a dissertation in the field of fraud prediction. Afterwards he accepted a research position at Ghent University in order to pursue a PhD in Data Analytics at the Faculty of Economics... Read More →

Tuesday May 10, 2016 2:00pm - 2:50pm PDT
Plaza C

3:00pm PDT

Real Time BOM Explosions with Apache Solr and Spark - Andreas Zitzelsberger, QAware GmbH
Bill of materials (BOMs) are at the heart of every manufacturing process. Especially large BOMs can be found in the automotive industry, where a complex and highly variable product meets high production volumes.
Drawing from the experiences made in an ongoing real world project for a major car manufacturer, Andreas will provide an in-depth view how Apache Solr and Apache Spark were used to power an innovative architecture that provides lightning-fast BOM explosions, demand forecasts and scenario-based planning on 20 billion records per scenario.

avatar for Andreas Zitzelsberger

Andreas Zitzelsberger

Principal Software Architect, QAware GmbH
Andreas is Principal Software Architect at QAware, an independent cloud native software manufacturer that has been repeatedly awarded Best IT Workplace in Germany. His focus is cloud native computing in all its glory. He is responsible for the heavy lifting at a large-scale cloud... Read More →

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Plaza C
Wednesday, May 11

10:50am PDT

Spark After Dark 2.0: Complete End-to-End, Real-time Advanced Analytics, Big Data Reference Pipeline Including Machine Learning, Graph Processing, and Text/NLP Analytics, and Streaming Approximations Using Kafka, Spark Streaming, Spark ML, Spark SQL - Chr
The audience will participate in a live, interactive demo that generates personalized, real-time recommendations using the latest open source streaming and big data processing tools available. We’ll dive deep into not only the architecture and application code, but also the Spark, Cassandra, and ElasticSearch internal codebases that power this awesome combination of technologies. All code and demos are available on Github and DockerHub. Follow the links @ advancedspark.com.

avatar for Chris Fregly

Chris Fregly

Developer Advocate, AI and Machine Learning, AWS

Wednesday May 11, 2016 10:50am - 11:40am PDT
Plaza C

11:50am PDT

Introducing Datasets: Bringing Compile Time Type Checking and Functional Transformations to Spark DataFrames - Holden Karau, IBM
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. DataFrames are a key part of the Spark SQL interface, allowing for relational style transformations and additional optimizations over Spark’s RDDs. Datasets bring much of the power, and compile time type checking, to Spark SQL allowing more developers to benefit from the Catalyst optimizer.

DataFrames allow developers in Apache Spark to access the power of the Catalyst optimizer while continuing to write Scala/Java/Python code. Datasets offer the ability for developers to easily write functional style transformations while still taking advantage of the Catalyst optimizer, compact bit level representation, and so on. Datasets are new in Spark 1.6 and the API will be changing in future versions. This talk will introduce and contrast the APIs.

avatar for Holden Karau

Holden Karau

Developer Advocate, Google
Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning... Read More →

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Plaza C

2:00pm PDT

Shared Memory Layer for Spark Applications - Dmitry Setrakyan, GridGain
In this presentation we will talk about the need to share state across different Spark
jobs and applications and several technologies that make it possible, including
Tachyon and Apache Ignite. We will dive into importance of In Memory File Systems,
Shared In-Memory RDDs with Apache Ignite, as well as present a hands on demo
demonstrating advantages and disadvantages of one approach over another. We will
also discuss requirements of storing data off-heap in order to achieve large horizontal
and vertical scale of the applications using Spark and Ignite.


Dmitriy Setrakyan

EVP Engineering, GridGain
Dmitriy Setrakyan is founder and Chief Product Officer at GridGain. Dmitriy has been working with distributed architectures for over 15 years and has expertise in the development of various middleware platforms, financial trading systems, CRM applications and similar systems. Prior... Read More →

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Plaza C

3:00pm PDT

Time Series Processing with Apache Spark - Josef Adersberger, QAware GmbH
A lot of data is best represented as time series: Operational data, financial data and even in general-purpose DWHs the dominant dimension is time. The area of time series databases is growing rapidly but the support in Spark to process and analyze time series data is still in the early stages. We present Chronix Spark which provides a mature TimeSeriesRDD implementation for fast retrieval and complex analysis of time series data. Chronix Spark is open source software and battle-proved at a big german car manufacturer and a german telco. We show how we‘ve used Chronix Spark in a real-life project and provide some benchmarks how it has outperformed common time series databases like OpenTSDB, KairosDB and InfluxDB. We lift the curtain and deep-dive into the internals how we‘ve achieved this.

avatar for Josef Adersberger

Josef Adersberger

CTO, QAware
Josef Adersberger is co-founder & CTO of QAware, a German custom software development company and CNCF silver member. He studied computer science in Rosenheim and Munich and holds a doctoral degree in software engineering. He is currently responsible for a large-scale cloud migration... Read More →

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Plaza C

4:10pm PDT

Secure Spark Shuffle: A Fast and Convenient Approach Using Chimera - Cheng Xu, Intel
Shuffle is the key process in Spark computing model. It’s very sensitive to performance. Since the frequent crimes and accidents arising from security, data encryption becomes more and more important for an enterprise ready product. In this talk, we will talk about how we use Chimera to secure the shuffle data. Chimera is a cryptography library optimized with AES-NI (Advanced Encryption Standard New Instructions). It provides Java API for both cipher level and Java stream level. It originates from Intel Diceros and Hadoop encryption at rest. It limits the performance impacts using hardware acceleration and helps users get rid of native issues used by native code. In this presentation, we will also show the performance results after enabling the shuffle encryption in Spark.

avatar for Cheng Xu

Cheng Xu

Senior Software Engineer, Intel
I am a software engineer from Intel. I am now working on Apache Hive project, Apache Parquet and Apache Spark Project. I am a committer of Apache HIVE project. Now I am focussed on Spark Authorization specially in Spark SQL component and the performance improvements in Apache Parquet... Read More →

Wednesday May 11, 2016 4:10pm - 5:00pm PDT
Plaza C

5:10pm PDT

Mining Public Datasets Using Apache Zeppelin (incubating) and Spark - Alexander Bezzubov, NFLabs
There are a lot of public datasets available in the wild and the number is growing. In meantime, ASF provides a plethora of free tools for any practitioner to build up on. In this talk Alexander will show how to levirage 2 of them, Zeppelin and Spark, for exploratory data anaytics and building a data product over two real datasets CommonCrawl http://commoncrawl.org and GithubArchive https://www.githubarchive.org


Alexander Bezzubov

Software Engineer, NFLabs
Alexander Bezzubov is Apache Zeppelin contributor, PMC member and software engineer at NFLabs. Previous speaking experience includes Apache BigData NA 2016 in Vancouver, FOSSASIA 2016 in Singapore, Apache BigData EU 2015 in Budapest.

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Plaza C