Wednesday, May 11 • 11:50am - 12:40pm
Introducing Datasets: Bringing Compile Time Type Checking and Functional Transformations to Spark DataFrames - Holden Karau, IBM

Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. DataFrames are a key part of the Spark SQL interface, allowing for relational style transformations and additional optimizations over Spark’s RDDs. Datasets bring much of the power, and compile time type checking, to Spark SQL allowing more developers to benefit from the Catalyst optimizer.

DataFrames allow developers in Apache Spark to access the power of the Catalyst optimizer while continuing to write Scala/Java/Python code. Datasets offer the ability for developers to easily write functional style transformations while still taking advantage of the Catalyst optimizer, compact bit level representation, and so on. Datasets are new in Spark 1.6 and the API will be changing in future versions. This talk will introduce and contrast the APIs.

Holden Karau

Principal Software Engineer, IBM
Holden Karau is a software development engineer and is active in open source. She a co-author of Learning Spark & Fast Data Processing with Spark and has taught intro Spark workshops. Prior to IBM she worked on a variety of big data, search, and classification problems at Alpine, DataBricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science. Outside of computers she... Read More →

Wednesday May 11, 2016 11:50am - 12:40pm
Plaza C