Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 
View analytic
Wednesday, May 11 • 11:50am - 12:40pm
Introducing Datasets: Bringing Compile Time Type Checking and Functional Transformations to Spark DataFrames - Holden Karau, IBM

Sign up or log in to save this to your schedule and see who's attending!

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. DataFrames are a key part of the Spark SQL interface, allowing for relational style transformations and additional optimizations over Spark’s RDDs. Datasets bring much of the power, and compile time type checking, to Spark SQL allowing more developers to benefit from the Catalyst optimizer.

DataFrames allow developers in Apache Spark to access the power of the Catalyst optimizer while continuing to write Scala/Java/Python code. Datasets offer the ability for developers to easily write functional style transformations while still taking advantage of the Catalyst optimizer, compact bit level representation, and so on. Datasets are new in Spark 1.6 and the API will be changing in future versions. This talk will introduce and contrast the APIs.

avatar for Holden Karau

Holden Karau

Developer Advocate, Google
Holden is a transgender Canadian open source developer advocate @ Google with a focus on Apache Spark, Airflow, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer and... Read More →

Wednesday May 11, 2016 11:50am - 12:40pm
Plaza C