Posted by on
Categories: Hadoop LinkedIn Pepperdata

Getting jobs to run on #Hadoop is one thing, but getting them to run well is something else entirely. With a nod to the pain that parallelism and big data diversity brings, #LinkedIn unveiled a new release of #DrElephant that aims to simplify the process of writing tight code for Hadoop. #Pepperdata also introduced new software that takes Dr. Elephant the next step into DevOps. The Hadoop infrastructure at LinkedIn is as big and complex as you likely imagine it to be. Distributed clusters run backend metrics, power experiments, and drive production data products that are used by millions of people. Thousands of internal users interact with Hadoop via dozens of stacks each day, while hundreds of thousands of data flows move data to where it needs to be. The social media company appreciates a tidy cluster, just like everybody else, but getting things in an orderly manner on Hadoop was beginning to resemble an impossible task, according to Carl Steinbach, a senior staff software engineer at LinkedIn. “We found that sub-optimized jobs were wasting the time of our users, using our hardware in an inefficient manner, and making it difficult for us to scale the efforts of the core Hadoop team,” Steinbach writes in a blog post yesterday. While LinkedIn does have a team of technical Hadoop experts at the ready, it would be a waste of time to have them tune each user’s job manually. “At the same time, it would be equally inefficient to try and train the thousands of Hadoop users at the company on the intricacies of the tuning process,” Steinbach writes.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.