Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 
Back To Schedule
Wednesday, May 11 • 11:50am - 12:40pm
Introducing Datasets: Bringing Compile Time Type Checking and Functional Transformations to Spark DataFrames - Holden Karau, IBM

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. DataFrames are a key part of the Spark SQL interface, allowing for relational style transformations and additional optimizations over Spark’s RDDs. Datasets bring much of the power, and compile time type checking, to Spark SQL allowing more developers to benefit from the Catalyst optimizer.

DataFrames allow developers in Apache Spark to access the power of the Catalyst optimizer while continuing to write Scala/Java/Python code. Datasets offer the ability for developers to easily write functional style transformations while still taking advantage of the Catalyst optimizer, compact bit level representation, and so on. Datasets are new in Spark 1.6 and the API will be changing in future versions. This talk will introduce and contrast the APIs.

avatar for Holden Karau

Holden Karau

Developer Advocate, Google
Holden Karau is a transgender Canadian open source developer advocate at Google focusing on Apache Spark, Beam, and related big data tools. Previously, she worked at IBM, Alpine, Databricks, Google (yes, this is her second time), Foursquare, and Amazon. Holden is the coauthor of Learning... Read More →

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Plaza C