Posted by on
Categories: Apache Big Data Cloudera Hadoop HDFS Yarn

The upcoming delivery of #Apache #Hadoop 3 later this year will bring big changes to how customers store and process data on clusters. Here at the annual Apache #BigData show in Miami, Florida, a pair of Hadoop project committers from #Cloudera shared details on how the changes will impact #YARN and #HDFS. The biggest change coming to HDFS with Hadoop 3 is the addition of erasure coding, says Cloudera engineer Andrew Wang, who is the Hadoop 3 release manager for the Apache Hadoop project at the Apache Software Foundation. HDFS historically has replicated each piece of data three times to ensure reliability and durability. However, all those replicas come at a big cost to customers, Wang says. “Many clusters are HDFS-capacity bound, which means that they’re always adding more nodes to clusters, not for CPU or more processing, but just to store more data,” he tells Datanami. “That means this 3x replication overhead is very substantial from a cost point-of-view.” The Apache Hadoop community considered the problem, and decided to pursue erasure coding, a data-striping method similar to RAID 5 or 6 that has historically been used in object storage systems. It’s a technology we first told you was coming to Hadoop 3 exactly one year ago, during last year’s Apache Big Data shindig. “The benefit of using a scheme like erasure coding is you can gain much better storage efficiency,” Wang says. “So instead of paying a 3x cost, you’re paying a 1.5x cost. So you’re saving 50% compared to the 3x replication, when you look at purely disk expenditure. Many of our Hadoop customers are storage bound, so being able to save them half their money in hard disk cost is pretty huge.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.