Apache: Big Data 2016: Full Schedule

Register Now or Visit the Website for more Information

10:40am PDT

A Faster Way for Faster Workflows - Ken Krugler, Scale Unlimited

Cascading is a popular open source project that makes it easier to create workflows for processing big data. In the past these always ran on top of Hadoop, but now there’s a new option - run them using Flink, a fundamentally stream-oriented dataflow engine that takes full advantage of available RAM.

In this presentation Ken Krugler will briefly describe Flink, and then discuss a real-world example of converting a complex workflow (100+ jobs, NLP processing of text, SVM-based classification, etc) from Hadoop to Flink.

Speakers

Ken Krugler

Scale Unlimited

Ken Krugler is a veteran entrepreneur, developer and instructor. He is the president of Scale Unlimited, a provider of consulting and training services for big data analytics, search, and machine learning using Hadoop, Cascading, Mahout, Cassandra and Solr. Ken is an Apache Tika committer... Read More →

A Faster Way for Faster Workflows pdf

Monday May 9, 2016 10:40am - 11:30am PDT
Plaza C

Faster-Better, Intermediate

11:40am PDT

Migrating Hundreds of Hadoop Pipelines into Docker Containers - Noa Resare, Spotify

Spotify maintains hundreds of big data pipelines built over a number of years, most of which runs one or more transformations on our 1800 node on-premise Hadoop cluster. There has been steady evolution with regards languages, frameworks and development strategies over those years and the result is a highly heterogenous set of pipelines with lots of specific demands the execution environment. To ensure stability while encouraging innovation, we are now leveraging Docker to contain some of the complexity and have a unified interface for the scheduling infrastructure. This talk is all about what we have learned in the process and how Spotify’s experience in running a large fleet of docker containers for production services has helped shape our efforts.

Speakers

Noa Resare

Free Software Ombudsman, Spotify

Noa Resare is a senior engineer and the Spotify Free Software Ombudsman. Noa is an accomplished public speaker has been giving talks at conferences such as Cloud Open, Usenix Lisa and LinuxCon on a wide variety of technical subjects.

Monday May 9, 2016 11:40am - 12:30pm PDT
Plaza C

Faster-Better, Intermediate

2:00pm PDT

Accelerating Cloud with FPGAs - Eric Fukuda, University of Tornoto

In our project, we are trying to make easy to use, scalable multi-FPGA fabrics available in data centers. There are several recent projects that try to employ FPGAs for accelerating data centers applications. However, those projects focus on accelerating specific applications rather than making their platforms usable for general developers. To make FPGAs available for various application developers, we are trying to virtualize FPGAs in data centers. We use OpenStack for allocating FPGAs placed in a data center, Apache Zookeeper to distribute the jobs across the FPGAs, and Apache Drill as a prospective application to use distributed FPGAs. As work in progress, our first goal was to achieve functionality. We have observed good scalability and expect the performance to improve as we incorporate SQL acceleration techniques for FPGAs.

Speakers

Eric Fukuda

University of Toronto

Eric is a Postdoctoral Fellow at the Department of Electric and Computer Engineering, University of Toronto. During his Ph.D. at Hokkaido University, he worked on a project to accelerate memcached with an FPGA. He is interested in accelerating large-scale databases with FPGAs.

Monday May 9, 2016 2:00pm - 2:50pm PDT
Plaza C

Faster-Better, Any

3:00pm PDT

Happier Developers and Happier Software Through Distributed Testing - Andrew Wang, Cloudera

A thorough unit test suite is a positive indicator of software quality. However, as the size of a test suite grows, its runtime can span hours or even days, to the point that it is unwieldy to run the full suite. Also, test runs at this scale are unlikely to succeed due to intermittent test failures. Together, these issues make the test suite less accessible to developers, which lowers developer productivity and decreases software quality.

Distributed testing offers a solution to these issues. Using our open-source distributed test infrastructure, we are able to speed up Apache Hadoop’s test suite by approximately 100x. This same framework is also in use by Apache HBase and Apache Kudu (incubating), with further projects planned.

In this talk, I will describe our distributed testing framework, how we use it at Cloudera, and how you could use this same framework on for your project.

Speakers

Andrew Wang

Software Engineer, Cloudera

Andrew Wang is a software engineer at Cloudera on the HDFS team, where he has worked on projects including in-memory caching, transparent encryption, and erasure coding. Previously, he was a PhD student in the AMP Lab at UC Berkeley, where he worked on problems related to distributed... Read More →

distributed testing apache big data 2016 pptx

Monday May 9, 2016 3:00pm - 3:50pm PDT
Plaza C

Faster-Better, Intermediate

4:10pm PDT

Delivering Realtime and Agile Analytics Using Apache Kafka, Spark & Drill - Neeraja Rentachintala, MapR technologies

Data is the biggest asset in modern organizations to enable building value added products and services as well as optimizing operations. Real time analytic pipelines with a messaging system such as Apache Kafka to capture the data followed by a general purpose transformation layer such as Spark to process and analyze it have become the prominent infrastructure to deliver relevant and timely information to variety of users and applications. Given the extreme diversity of the data sources,an additional consideration for such pipelines is having agility in being able to adapt to changes to the underlying structure of data without incurring lot of development costs and missing SLAs. In this session, Neeraja will cover how Apache Drill’s ability to query complex and dynamically evolving datasets can compliment these solutions and new use cases enabled by using Drill, Spark and Kafka together.

Speakers

Neeraja Rentachintala

Director of Product Management, MapR technologies

As Sr Director of Product Management, Neeraja is responsible for the product strategy, roadmap and requirements of MapR SQL initiatives. Prior to MapR, Neeraja held numerous product management and engineering roles at Informatica, Microsoft SQL Server, Oracle and Expedia.com, most... Read More →

Monday May 9, 2016 4:10pm - 5:00pm PDT
Plaza C

Faster-Better, Beginner

5:10pm PDT

Building While Flying: Lessons Learned from Operating and Developing a Graph Service with TinkerPop - Keith Lohnes & David Pitera, IBM

Apache TinkerPop is an open source graph computing framework which uses Gremlin, a domain-specific language for graphs mutation and traversal. IBM Graph offers an Apache TinkerPop3 compatible API as a service. This service can be used for building recommendation engines, analyzing social networks, fraud detection and more. During this session, we will cover:

What’s a Graph and why use it
Challenges faced and lessons learned while building and operating a service based on TinkerPop3 stack

Speakers

Keith Lohnes

Software Engineer, IBM

Keith Lohnes graduated from Northeastern University with a Degree in Computer Science and Music and has been working as a developer for 7 years. He started working with graph databases about 2 years ago and joined IBM to work on their IBM Graph offering.

David Pitera

Software Engineers, IBM

David works on JanusGraph on Compose, a managed graphdb cloud solution using Scylla as the primary data store... Read More →

Building While Flying Lessons Learned pdf

Monday May 9, 2016 5:10pm - 6:00pm PDT
Plaza C

Faster-Better, Any