Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Operations-Use Cases [clear filter]
Monday, May 9

5:10pm PDT

Building a Durable Real-Time Data Pipeline: Apache BookKeeper at Twitter - Sijie Guo & Leigh Stewart, Twitter
Log has been proven to be a very powerful data structure for addressing challenging distributed systems problems. DistributedLog is such a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s durable real-time data pipeline, and has been used widely elsewhere at Twitter in applications including transactional database system, search ingestion pipeline, and real-time streaming data-analytics platform. In this talk, Sijie Guo will discuss what are the challenges on building durable real-time data pipeline, how they achieve it and how they use it to support different workloads with different characteristics from a strongly-consistent distributed database to a real-time data analytics pipeline.


Sijie Guo

Currently work for Twitter on DistributedLog/BooKeeper. Apache BookKeeper PMC Chair. Previously work for Yahoo! on push notification system.

Monday May 9, 2016 5:10pm - 6:00pm PDT
Georgia B
Tuesday, May 10

3:00pm PDT

Focused Crawling with Apache Nutch - Sujen Shah, NASA JPL
The vast nature of the Web has forced researchers to continually develop advanced data acquisition strategies that overcome a multitude of obstacles in order to acquire relevant topical content and assimilate it with their needs. Many groups have researched focused Web crawling techniques in order to better guide their data acquisition efforts, however few approaches consider the scenario where one wishes to undertake DD on the open Web for which no prior semantic knowledge resources are available. Sujen and his team have investigated and developed a new application of the cosine similarity metric (CSM) which has been implemented as part of a novel strategy for domainspecificDD. 

In this presentation, Sujen would review the recent work in focused crawling and the ability to run similarity scoring within a production ready, scalable Web crawler, Apache Nutch.

avatar for Sujen Shah

Sujen Shah

Research Intern, NASA Jet Propulsion Laboratory
Sujen is a Masters student pursuing Computer Science at the University of Southern California, Los Angeles. As a committer and member of the Apache Nutch PMC, his work includes augmenting the focused crawling capabilities of Nutch. These new scoring plugins are supporting the efforts... Read More →

Tuesday May 10, 2016 3:00pm - 3:50pm PDT
Georgia B
Wednesday, May 11

2:00pm PDT

ODPi and ASF Collaboration: Ask Us Anything! - John Mertic, ODPi & Jim Jagielski, Apache Software Foundation

The Apache Software Foundation (ASF) has long been the champion of open source projects that compose the larger Apache Hadoop ecosystem. ODPi is complementary to those efforts, solely focused on easing integration and standardization for downstream application vendors and end-users that build upon Apache Hadoop®. Since ODPi’s launch in 2015, there has been some confusion around how its work may overlap, or potentially compete, with that of the ASF.

Founding Member and Board Director - Apache Software Foundation, Jim Jagielski, and Director of Program Management for ODPi, John Mertic, will clear up this confusion. During the discussion, attendees will learn how ASF and ODPi are collaborating to accelerate enterprise adoption of Apache Hadoop and big data technologies. There will also be an open Q&A, where attendees can ask about ASF and ODPi projects, their work together, where the big data ecosystem is heading, and anything else that comes to mind.

avatar for Jim Jagielski

Jim Jagielski

Developer, Uber
Jim Jagielski is a well-known and acknowledged expert and visionary in open source, an accomplished coder, and frequent engaging presenter on all things open, web, blockchain, and cloud related. As a developer, he’s made substantial code contributions to just about every core technology... Read More →
avatar for John Mertic

John Mertic

Director of Program Management, The Linux Foundation
John Mertic is the Director of Program Management for The Linux Foundation. Under his leadership, he has helped ASWF, ODPi, Open Mainframe Project, and R Consortium accelerate open source innovation and transform industries. John has an open source career spanning two decades, both... Read More →

Wednesday May 11, 2016 2:00pm - 2:50pm PDT
Plaza B

3:00pm PDT

Scylla: A Revolutionary Design for NoSQL Performs at 1.8M TPS/node - Don Marti & Tzach Livyatan, ScyllaDB
Scylla is a new NoSQL database, compatible with Apache Cassandra, that is capable of a 10x improvement in throughput on the same hardware, with predictable low latency that dramatically improves the performance of analytics originally developed for Cassandra. The database is now in use in production and in pilot projects internationally.

Scylla applies kernel programming techniques to a horizontally scalable NoSQL design to achieve extreme performance improvements and the elimination of garbage collection pauses. The Scylla design is based on a modern shared-nothing approach.   A new architecture for the NoSQL server is necessary because of new growth in, and limitations of, modern server hardware. As CPU core counts continue to grow, along with the raw speed of networking and storage devices available on a modern system, software design approaches that were valid and safe even a few years ago are no longer sustainable. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC.

With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Scylla enables faster cluster scaling, more overhead to handle complex queries, and the power to do complex analytics tasks at the same time as routine administration operations.

avatar for Tzach Livyatan

Tzach Livyatan

VP Product, Scylla
Tzach Livyatan has a B.A. and MSc in Computer Science (Technion, Summa Cum Laude), and has had a 15 year career in development, system engineering and product management. In the past he worked in the Telecom domain, focusing on carrier-grade systems, signalling, policy and charging... Read More →

Don Marti

Don Marti has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and... Read More →

Wednesday May 11, 2016 3:00pm - 3:50pm PDT
Plaza B