Loading…
Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Spark [clear filter]
Tuesday, May 10
 

9:00am

Random Forest Clustering with Apache Spark - Erik Erlandson, Red Hat, Inc.
Analytics applications often boil down to grouping objects into two or more clusters having similar elements. Defining what “similar” means can be surprisingly difficult when data elements have many columns or dimensions. Having tools at hand to generate quality clusters from high-dimensional data greatly increases the variety of applications that can successfully leverage clustering.

In this presentation, Erik Erlandson will introduce the basic principles and advantages of Random Forest learning models and Random Forest clustering. He will explain how to build up an implementation of Random Forest clustering in the Apache Spark analytics framework, based on the Spark MLLib Random Forest modeling API.

The presentation will include examples of Random Forest clustering applied to VM installed-package profiles and a discussion of practical issues encountered along the way.

Speakers
avatar for Erik Erlandson

Erik Erlandson

Principal Software Engineer, Red Hat



Tuesday May 10, 2016 9:00am - 9:50am
Plaza C

11:20am

Spark Cyborgs - Deep Integration of Spark with Parallel Relational Engines - Torsten Steinbach & Gustavo Arocena, IBM
In this session we describe a family of hybrid engines that result from a deep two-way integration between Spark and parallel RDBMSs. This integration differs from projects like Hive on Spark, that leverage Spark purely as an execution framework. It also goes beyond what’s possible with the current version of the DataSources API in terms of leveraging the capabilities of the storage backend. In our presentation you will learn about four essential building blocks of the hybrid engines:
1. Derive DataFrame partitioning implicitly from parallel RDBMS partitioning
2. Colocation and efficient data movement between Spark and RDBMS processes
3. Hybrid queries by augmenting parallel RDBMS with Spark
4. Spark machine learning integrated in RDBMS for relational data

Speakers
GA

Gustavo Arocena

Big Data Architect, IBM
Gustavo Arocena is a Big Data Architect at the IBM Toronto Lab, with more than 10 years of experience in database technology and language processing. Recently he has lead the design and implementation of several components of the Big SQL engine, including the Hive-compatible IO layer... Read More →
TS

Torsten Steinbach

IBM
Torsten has been a software architect for database technology in IBM for many years. He lead product development for DB2 performance management tooling, Netezza workload management and in-database analytics. Currently he works on IBM’s cloud data warehouse dashDB and it’s integrated... Read More →



Tuesday May 10, 2016 11:20am - 12:10pm
Plaza C

2:00pm

On the Fly Retraining of Predictive Analytical Models Using Spark Streaming: An Equity-price Direction Prediction Case Study - Tijl Carpels, Ghent University
FinTech companies are facing the challenge of predicting the direction of equity prices. During this study we have used algorithms provided in Spark Mllib to address this problem. Due to the characteristics of the equity market this happens in a streaming environment requiring us to continuously monitor the performance of the predictive model. When the performance drops below a certain threshold we trigger a batch training of the model. We made a proof of concept using different open-source tools. (Apache Spark and Spark-notebook)

Speakers
avatar for Tijl Carpels

Tijl Carpels

Doctoral Researcher - Data Scientist, Ghent University
Tijl Carpels received his M.Sc. degree in Business Engineering (major: Finance) in 2015 after writing a dissertation in the field of fraud prediction. Afterwards he accepted a research position at Ghent University in order to pursue a PhD in Data Analytics at the Faculty of Economics... Read More →



Tuesday May 10, 2016 2:00pm - 2:50pm
Plaza C
 
Wednesday, May 11
 

11:50am

Introducing Datasets: Bringing Compile Time Type Checking and Functional Transformations to Spark DataFrames - Holden Karau, IBM
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. DataFrames are a key part of the Spark SQL interface, allowing for relational style transformations and additional optimizations over Spark’s RDDs. Datasets bring much of the power, and compile time type checking, to Spark SQL allowing more developers to benefit from the Catalyst optimizer.

DataFrames allow developers in Apache Spark to access the power of the Catalyst optimizer while continuing to write Scala/Java/Python code. Datasets offer the ability for developers to easily write functional style transformations while still taking advantage of the Catalyst optimizer, compact bit level representation, and so on. Datasets are new in Spark 1.6 and the API will be changing in future versions. This talk will introduce and contrast the APIs.

Speakers
avatar for Holden Karau

Holden Karau

Developer Advocate, Google
Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning... Read More →


Wednesday May 11, 2016 11:50am - 12:40pm
Plaza C

4:10pm

Secure Spark Shuffle: A Fast and Convenient Approach Using Chimera - Cheng Xu, Intel
Shuffle is the key process in Spark computing model. It’s very sensitive to performance. Since the frequent crimes and accidents arising from security, data encryption becomes more and more important for an enterprise ready product. In this talk, we will talk about how we use Chimera to secure the shuffle data. Chimera is a cryptography library optimized with AES-NI (Advanced Encryption Standard New Instructions). It provides Java API for both cipher level and Java stream level. It originates from Intel Diceros and Hadoop encryption at rest. It limits the performance impacts using hardware acceleration and helps users get rid of native issues used by native code. In this presentation, we will also show the performance results after enabling the shuffle encryption in Spark.

Speakers
avatar for Cheng Xu

Cheng Xu

Senior Software Engineer, Intel
I am a software engineer from Intel. I am now working on Apache Hive project, Apache Parquet and Apache Spark Project. I am a committer of Apache HIVE project. Now I am focussed on Spark Authorization specially in Spark SQL component and the performance improvements in Apache Parquet... Read More →



Wednesday May 11, 2016 4:10pm - 5:00pm
Plaza C