How Yahoo’s Internal Hadoop Cluster Does Double-Duty on Deep Learning
Five years ago, many bleeding edge IT shops had either implemented a #Hadoop cluster for production use or at least had a cluster set aside to explore the mysteries of #MapReduce and the #HDFS storage system. While it is not clear all these years later how many ultra-scale production Hadoop deployments there are in earnest (something we are analyzing for a later in-depth piece), those same shops are likely on top trying to exploit the next big thing in the datacenter—machine learning, or for the more intrepid, deep learning. For those that were able to get large-scale Hadoop clusters into production and who now enjoy a level of productivity on those systems, integrating deep learning and machine learning presents a challenge—at least if that workload is not being moved to an entirely different cluster for the deep learning workload. How can Caffe and TensorFlow integrate with existing data in HDFS on the same cluster? It turns out, it is quite a bit easier, even with the addition of beefy GPU-enabled nodes to handle some of the training part of the deep learning workflow. Work on this integration of deep learning and Hadoop comes from the least surprising quarters—Yahoo, the home of Hadoop and MapReduce over a decade ago. Yahoo’s main internal cluster for research, user data, production workloads across its many brands and services (search, ad delivery, Flickr, email), and now deep learning is all based on a mature Hadoop-centered stack. Over time, teams at #Yahoo have integrated the many cousins of #Hadoop; #Spark, #Tez, and more, but they are now looking to capture trends in open source that float away from the #Apache base for large-scale analytics they’ve cultivated over the years.