How to make sure your Hadoop data lake doesn’t become a swamp
The term “data lake” has been popular for a few years now, particularly in the context of Hadoop-based systems for large-scale data processing. But as Constellation Research VP and principal analyst Doug Henschen notes in an in-depth new report, it’s no simple task to create a data lake that lives up to the concept’s potential: The rough idea of the data lake is to serve as the first destination for data in all its forms, including structured transactional records and unstructured and semi-structured data types such as log files, clickstreams, email, images, social streams and text documents. Some label unstructured and semi-structured as “new” data types, but most have been around a long time. We just couldn’t afford to retain or analyze this information–until now. Data lakes can handle all forms of data, including structured data, but they are not a replacement for an enterprise data warehouse that supports predictable production queries and reports against well-structured data. The value in the data lake is in exploring and blending data and using the power of data at scale to find correlations, model behaviors, predict outcomes, make recommendations, and trigger smarter decisions and actions. The key challenge is that a #Hadoop deployment does not magically turn into a data lake. As the number of use cases and data diversity increases over time, a data lake can turn into a swamp if you fail to plan and implement a well-ordered data architecture.