Posted by on
Tags: , , , , ,
Categories: Hadoop

Although @Hadoop and #BigData (whatever that is) are the new kids on the block, don’t be too quick to write off relational #database technology. In this article, I’ll explain the differences (and benefits) of both solutions.

Hadoop Is NOT a Database!
As much as the marketing hype would have us believe, Hadoop is NOT a database, but a collection of open-source software that runs as a distributed storage framework (HDFS) to manage very large data sets. Its primary purpose is the storage, management, and delivery of data for analytical purposes. It’s hard to talk about Hadoop without getting into keywords and jargon (for example, Impala, YARN, Parquet, and Spark), so I’ll start by explaining the basics.

At the very core of Hadoop is HDFS (Hadoop Distributed File System). So, it’s not a database after all — at its core, it’s a file system, but a very powerful one.

Hadoop Is a Different Kind of Animal
It’s impossible to really understand Hadoop without understanding it’s underlying hardware architecture, which gives it two of it’s biggest strengths, it’s scalability and massive parallel processing (MPP) capability.

To illustrate the difference, the diagram below illustrates a typical database architecture in which a user executes SQL queries against a single large database server. Despite sophisticated caching techniques, the biggest bottleneck for most Business Intelligence applications is still the ability to fetch data from disk into memory for processing. This limits both the system processing and it’s ability to scale — to quickly grow to deal with increasing data volumes.

As there’s a single server, it also needs expensive redundant hardware to guarantee availability. This will include dual redundant power supplies, network connections and disk mirroring which, on very large platforms can make this an expensive system to build and maintain.

Compare this with the Hadoop Distributed Architecture below. In this solution, the user executes SQL queries against a cluster of commodity servers, and the entire process is run in parallel. As effort is distributed across several machines, the disk bottleneck is less of an issue, and as data volumes grow, the solution can be extended with additional servers to hundreds or even thousands of nodes.

Hadoop has automatic recovery built in such that if one server becomes unavailable, the work is automatically redistributed among the surviving nodes, which avoids the huge cost overhead of an expensive standby system. This can lead to a huge advantage in availability, as a single machine can be taken down for service, maintenance or an operating system upgrade with zero overall system downtime.

The 3 Vs and the Cloud
Hadoop has several other potential advantages over a traditional RDBMS most often explained by the three (and increasing) Vs.

Volume — It’s distributed MPP architecture makes it ideal for dealing with large data volumes. Multi-terabyte data sets can be automatically partitioned (spread) across several servers, and processed in parallel.
Variety — Unlike an RDBMS where you need to define the structure of your data before loading it, in HDFS, loading data can be as simple as copying a file – which can be in any format. This means Hadoop can just as easily manage, store and integrate data from a database extract, a free text document or even JSON or XML documents and digital photos or eMails.
Velocity — Again the MPP architecture and powerful in-memory tools (including Spark, Storm, and Kafka), which form part of the Hadoop framework, make it an ideal solution to deal with real or near-real-time streaming feeds which arrive at velocity. This means you can use it to deliver analytics-based solutions in real time. For example, using predictive analytics to recommend options to a customer.
The advent of The Cloud leads to an even greater advantage (although not another “V” in this case) — Elasticity.

That’s the ability to provide on-demand scalability using cloud-based servers to deal with unexpected or unpredictable workloads. This means entire networks of machines can spin up as needed to deal with massive data processing challenges while hardware costs are restrained by a pay-as-you-go model. Of course, in a highly regulated industry (eg. Financial Services) with highly sensitive data, the cloud may well be treated with suspicion, in which case you may want to consider an “On-Premises Cloud”-based solution to secure your data.