Hadoop and Spark: Synergy Is Possible
If somebody mentions #Hadoop and #Spark together, they usually contrast these two popular big data frameworks. According to Ahrefs, 1,200 #Google visitors are searching for Spark vs. Hadoop each month, while only 90 are inquiring about Spark and Hadoop. It looks like the frameworks have gradually gained a reputation of being mutually exclusive. But this is not always the case. There are multiple ways for businesses to benefit from their synergy. Let’s take a closer look at Hadoop and Spark and discover scenarios where they can work together. #ApacheHadoop Defined Apache Hadoop is an open-source framework for data storage and parallel processing. Initially released in 2011, Hadoop triggered #bigdata evolvement. Distributed #datastorage allowed companies to cope with big data volumes. They didn’t need to buy extremely expensive custom hardware anymore. Instead, they could use multiple affordable computers to store data. Besides, this approach enabled the much-needed scalability of the solution. When the amount of data to be stored and processed increased, companies could solve that challenge by adding extra computers. With such an approach to data storage, parallel data processing was required. This became another distinctive feature of Hadoop. Apache Spark Defined Apache Spark is an open-source framework for parallel processing. Released in 2014, Spark was designed to cope with the shortcomings of Hadoop MapReduce, which was mainly the speed of processing. Unlike Hadoop MapReduce, which has to write interim analysis results back to the disk and then read the data again and again, Spark processes data in-memory. As a result, it is up to 100 times faster than Hadoop MapReduce. By the way, if we talk about real alternatives, there are Apache Spark and Hadoop MapReduce (not entire Hadoop). And this is evident already from Spark’s definition. Using Hadoop and Spark Together One important remark: the Hadoop ecosystem consists of several components, among which is Hadoop Distributed File System (or HDFS for short), Apache Hive (a query engine), Hadoop MapReduce (a framework for the parallel processing of distributed large datasets), and more. With this information in mind, let’s take a look at possible synergy scenarios. HDFS + Apache Spark We have already clarified that Apache Spark’s intended purpose is data processing. But to process data, the engine needs to take it from some storage first. HDFS is not the only option available, but it’s a quite frequent one. The reason is simple: both belonging to Apache Foundation family, HDFS and Spark are highly compatible. An illustrative example of such a synergy is a word count (you can find the code example here). The sequence of operations is as follows: Apache Spark takes a text file from HDFS, divides each line into separate words, sets the value 1 for each word, calculates the sum of values for each word, and records the result to HDFS. Apache Hive + Apache Spark The combination of Apache Spark and Apache Hive (that is based on HDFS) allows solving many business tasks, for example, conducting customer behavior analytics. Imagine a company that cumulates data from multiple sources: clickstream data, comments, and posts on social media, data from customer mobile apps, etc. Let’s say that the company has chosen HDFS to store their data and Apache Hive to act as an intermediary between HDFS and Spark. Apache Hive makes it possible to query the data using a SQL-like language. As a result, Spark that has special support for Hive could easily access the data and process it. In the end, the company can understand the preferences and behavior patterns of each customer. Real-Life Examples of Spark and Hadoop Duets Real-life examples of using Hadoop and Spark together are not rare in big data consulting practices. The list of companies that adopt such an approach includes many well-known names. Undoubtedly, their solutions are of different complexity. And this is understandable, as these companies strive to solve different business tasks. Still, there is one thing that unites them: their big data technology stack includes both Hadoop and Spark. Let’s look at the following two examples. TripAdvisor uses Hadoop and Spark together to deliver a seamless customer experience. They introduced auto-tagging, which is based on the analysis of visitors’ reviews and tags. This feature allows TripAdvisor to predict whether a visitor’s impression of a particular location will be the same as that of the other visitors. Another interesting feature is improved photo selection. Now, a website visitor can get a more precise picture of any location thanks to a better choice of visuals. For instance, if a hotel has a pool, machine learning algorithms will pick the photo of the pool and show it to the visitor. Uber is doing a great job of managing their big data to improve their service. They know the typical behavior of each customer (starting and destination points, usual day and time of their journeys, etc.). The company also uses real-time traffic situations to adjust the number of drivers needed at a particular time and in a particular location. To make this possible, Uber uses HDFS for loading raw data onto Hive for SQL-powered analysis and Spark for processing of millions of events. Conclusion Now, you can see that Hadoop and Spark can smoothly work together. We have supplied the article with real-life examples so that you can see that the synergy of these big data frameworks is possible not only in theory but also in practice. When making a choice, just remember one of the maxims of big data: your big data technology stack should suit your business goals.