Apache: Big Data 2016 has ended
Register Now or Visit the Website for more Information 

Sign up or log in to bookmark your favorites and sync them to your phone or calendar.

Monitoring-Benchmarking [clear filter]
Wednesday, May 11

11:50am PDT

Experiences Using Apache HTrace (Incubating) in Distributed Web Search - Lewis McGibbney, NASA JPL
Recent developments within the tracing community have brought projects like Apache HTrace (Incubating) into the Apache Incubator opening up the possibility of utilizing tracing logic to better understand distributed applications, systems and systems-of-systems. As many will know, tracing involves a specialized use of logging to record information about a program’s execution. Although many use cases involve the use of tracing within distributed systems such as Hadoop and databases, few tracing experiments belong within the field of large scale, distributed Web search. This presentation will combine comprehensive tracing mechanisms in Apache HTrace (Incubating) with the scalable, flexible crawling architecture presented by Apache Nutch. Key takeaways from this presentation are development and implementation, tracing guidance for your web search stack and future work in this area.

avatar for Lewis McGibbney

Lewis McGibbney

Enterprise Search Technologist III, Jet Propulsion Laboratory

Wednesday May 11, 2016 11:50am - 12:40pm PDT
Georgia B

5:10pm PDT

Less Is More: Doubling Storage Efficiency with HDFS Erasure Coding - Zhe Zhang, LinkedIn & Kai Zheng, Intel
Ever since its creation, HDFS has been relying on data replication to shield against most failure scenarios. However, with the explosive growth in data volume, replication is getting expensive: the default 3x replication scheme incurs a 200% storage overhead. Erasure coding (EC) uses far less storage space while still providing the same level of fault tolerance. Under typical configurations, EC reduces the storage cost by ~50% compared with 3x replication.

In this talk we will introduce the design and implementation of HDFS-EC, and recommended use cases. We will also provide preliminary performance results. Equipped with the Intel ISA-L library, HDFS-EC has largely eliminated the computational overhead in codec calculation. Under sequential I/O workloads, it achieves twice the throughput compared with 3x replication, by performing striped I/O to multiple DataNodes in parallel.


Zhe Zhang

Zhe Zhang is a software engineer at LinkedIn working on Hadoop. He’s an Apache Hadoop Committer and author of HDFS Erasure Coding. Before joining LinkedIn in Feburary 2016 Zhe was an engineer in Cloudera HDFS team. Prior to that he worked at the IBM T. J. Watson Research Center... Read More →

Kai Zheng

Kai is a senior software engineering in Intel that works in big data and security fields for quite a few of years. He is a key Apache Kerby initiator, Directory PMC member and Apache Hadoop committer.

Wednesday May 11, 2016 5:10pm - 6:00pm PDT
Georgia B