Loading…
Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Machine Learning [clear filter]
Wednesday, May 11
 

10:50am

SystemML - Declarative Machine Learning - Luciano Resende, IBM
Machine learning in the enterprise is an iterative process. Data scientists will tweak or replace their learning algorithm in a small data sample until they find an approach that works for the business problem and then apply the Analytics to the full data set. Apache SystemML is a new system that accelerates this kind of exploratory algorithm development for large-scale machine learning problems. SystemML provides a high-level language to quickly implement and run machine learning algorithms on Spark. SystemML’s cost-based optimizer takes care of low-level decisions about how to use Spark’s parallelism, allowing users to focus on the algorithm and the real-world problem that the algorithm is trying to solve. This talk will introduce you to SystemML and get you started building declarative analytics with SystemML using a simple Zeppelin notebook and running on Apache Spark environment.

Speakers
avatar for Luciano Resende

Luciano Resende

Architect, Spark Technology Center, IBM
Luciano Resende is an Architect in IBM Analytics. He has been contributing to open source at The ASF for over 10 years, he is a member of ASF and is currently contributing to various big data related Apache projects including Spark, Zeppelin, Bahir. Luciano is the project chair for... Read More →


Wednesday May 11, 2016 10:50am - 11:40am
Georgia A

11:50am

Boost Spark ML Performance with Project Mnemonic - Yanping Wang & Gang Wang, Intel Corp.
Project Mnemonic is an open-source, structured data in-place persistence library for Java-based applications and frameworks. It provides unified interfaces for data manipulation on heterogeneous block/byte-addressable devices, such as DRAM, SSD, NVMe, and Cloud/network devices.
In this presentation, we will first introduce Project Mnemonic and non-volatile Java object model that defines in-memory non-volatile objects which can be directly stored in persistent memory. We will discuss how it can be used to allocate and reclaim heterogeneous memory and storage resources directly on DRAM, NVMe, other persistent memories, and SSD. Then we will show how in-memory non-volatile RDDs can be implemented in Spark. Finally we will present that 2X plus performance boost can be achieved on a Spark ML workload after removing SerDe RDDS, caching hot data, and reducing GC pause time dramatically.

Speakers
avatar for Yanping Wang

Yanping Wang

Software Engineer, Intel Corp
As a Senior Software Performance Engineer at Intel, Yanping has been working on Java and Big Data applications performance for the past 15 years. Currently, she is focusing on improving Big Data applications performance by reducing garbage collection and serialization/de-serialization... Read More →



Wednesday May 11, 2016 11:50am - 12:40pm
Georgia A

2:00pm

Combining Machine Learning Frameworks with Apache Spark - Tim Hunter, Databricks, Inc.
Machine Learning (ML) workflows involve a sequence of processing and learning stages. Realistic workflows combine specialized libraries with more general data management workflows.

Apache Spark is well-known as a powerful platform to perform iterative computations required for ML. This talk presents how to combine the strengths of Spark’s ML library (MLlib) with popular packages such as scikit-learn and TensorFlow. Scikit-learn is the de facto standard ML library for Python, and TensorFlow is a library for deep learning recently open-sourced by Google.

We also discuss the improvements of MLlib in Spark 2.0 and the future of MLlib’s APIs. On the roadmap are both more algorithms and features for users, and more utilities and abstractions to aid developers.

Speakers
TH

Tim Hunter

Databricks, Inc.
Tim Hunter is a software engineer at Databricks and contributes to the Spark MLlib project. He has been building distributed Machine Learning systems with Spark since version 0.5, before Spark was an Apache Software Foundation project.


Wednesday May 11, 2016 2:00pm - 2:50pm
Georgia A

3:00pm

Real-world Analytics with Solr Cloud and Spark - Johannes Weigend, QAware GmbH
Apache Solr is a distributed NoSQL database with impressive search capabilities. Apache Spark makes M/R faster and richer. In this code-intense session shows how to combine both to solve real-time search and processing problems. The demos feature a portable Solr Cloud / Spark Cluster based on Intel NUC Hardware.

Speakers
avatar for Johannes Weigend

Johannes Weigend

CTO, QAware GmbH
Johannes works as a software architect with Java since 1999 and was honoured as "Java Rockstar" at JavaOne 2015. He is a lecturer at the University of Applied Sciences in Rosenheim, Germany and technical director at QAware, a decorated software engineering company located in Munich... Read More →



Wednesday May 11, 2016 3:00pm - 3:50pm
Georgia A

4:10pm

Distributed Machine Learning with Apache Mahout - Suneel Marthi, Red Hat
Data Science tools like R,Scikit-Learn as they offer a convenient and familiar syntax for analysis tasks. However, these systems are limited to operating serially on data sets that can fit on a single node. Mahout-Samsara is a linear algebra environment that offers both an easy-to-use Scala DSL and efficient distributed execution for linear algebra operations.In this talk, we will look at Mahout’s distributed linear algebra capabilities and build a simple ML algorithm using the Samsara DSL. We’ll be demonstrating this using Apache Flink as the backend distributed engines.ML practitioners will come away from this talk with a better understanding of how Samsara’s linear algebra environment can help simplify developing highly scalable ML algorithms by focusing solely on the declarative specification of the algorithm while not worrying about the details of scalable distributed implementation

Speakers
avatar for Suneel Marthi

Suneel Marthi

AWS
Suneel is a Member of Apache Software Foundation and is a Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams. He's presented in the past at Flink Forward, Hadoop Summit, Berlin Buzzwords, Machine Learning Conference, Big Data Tech Warsaw and Apache Big Data.



Wednesday May 11, 2016 4:10pm - 5:00pm
Georgia A