Hadoop 3 confronts the realities of storage growth
When it came to storage, the promises of #Hadoop sound redolent of those associated with the early days of nuclear power: commodity storage would be too cheap to meter. The original design premise was that if you could make massively parallel computing linearly scalable and bring it close to the data, using commodity hardware would make storage costs an afterthought. That was the assumption behind Hadoop’s original approach to high-availability: with storage so cheap, make data available (and local) with three full replicas of the data. But of course the world is hardly running out of data; most predictions have it growing geometrically or exponentially. At some point, big data storage must obey the laws of gravity, if not physics; having too much of it will prove costly either in overhead or dollars, and more likely, both. The Hadoop 3.0 spec acknowledges that, at some point, too much cheap becomes expensive. Hadoop 3.0 is borrowing some concepts from the world of enterprise storage. With erasure coding, Hadoop 3.0 adopts RAID techniques for reducing data sprawl. With erasure coding, you can roughly cut the sprawl of storage compared to tradition 3x replication by about 50%. The price of that, of course, is that while replicated data is immediately available, data managed through RAID approaches must be restored – meaning you won’t get failover access instantly. So, in practice, the new erasure coding support of Hadoop 3.0 will likely be utilized as a data tiering strategy, where colder (which is usually older) data is stored on cheaper, often slower media. ￼ For #Hadoop, these approaches are not new. @DellEMC has offered its own support of #HDFS #emulation for its #Isilon #scaleout #filestorage for several years. The difference with 3.0 is that the goals of this approach – reducing data sprawl – can now be realized through open source technology. But this means that Hadoop adopters must embrace information lifecycle management processes for deprecating colder data; we expect that providers of data archival and information lifecycle management tooling will fill this newly-created vacuum created by the open source world. So now that we’ve come up for a way to contain storage sprawl, we’ll raise the urgency for it by scaling out YARN. Hadoop 3.0 adds support for federating multiple YARN clusters. Initial design assumptions for YARN pointed to an upward limit of roughly 10,000 nodes; while only Yahoo exceeds that count today, federating YARN allows for a future where more organizations are likely to breach that wall. Driving all this, of course, is several factors. First, and most obvious, the generation of data is growing exponentially. But secondly, those organizations that got into Hadoop earlier are by now likely to have multiple clusters, either by design or inertia. And in many cases, that proliferation is abetted by the fact that deployments might be sprawled, not only on premises, but also in the cloud. YARN federation is not necessarily about governance per se, but it could open the door in the Hadoop 4.x generation to support for federation across Hadoop’s data governance projects, like Atlas, Ranger, or Sentry. ￼ But wait, there’s more. With 3.0, Hadoop is laying the groundwork for managing more heterogeneous clusters. It’s not that you couldn’t mix and match different kinds of hardware, period. Until, now, the variants tended to revolve adding new generations of nodes that might pack more compute or storage compared to what you would have already had in place; YARN supported that. But with emergence of GPUs and other specialized hardware for new types of workloads (think deep learning), there was no way for YARN to explicitly manage those allocations. In the 3.0 version, the ground work is being laid for managing new hardware types. For GPUs and FPGAs, that will come to fruition in the 3.1 and 3.2 dot releases, respectively. Hadoop 3.0 also takes the next step in bolstering NameNode reliability. While Hadoop 2.0 supported deployment of two nodes (one active, one standby), now you can have as many standby NameNodes as you want for more robust high availability/failover. Or as one of my colleagues remarked, it’s about bloody time that the Apache community added this important feature.