Building the Enterprise-Ready Data Lake: What It Takes To Do It Right
The last year has seen a significant growth in the number of companies launching data lake initiatives as their first mainstream production #Hadoop -based project. This isn’t surprising given the compelling technical and economic arguments in favor of Hadoop as a data management platform, and the continued maturation of Hadoop and its associated ecosystem of open source projects. The value is undeniable: Providing a true “Data As A Service” solution within the enterprise has business users engaged, productive and driving immediate value. Cloudera and Hortonworks (NASDAQ: HDP) continued work around #Atlas, #Sentry, #Ranger, #RecordService, #Knox and #Navigator projects signal continued efforts to improve data security, metadata management, and data governance for data in Hadoop. The problem is that despite these incremental improvements, Hadoop alone still lacks many of the essential capabilities required to securely manage and deliver data to business users through a data lake at an enterprise scale. For example, the significant challenge and critical task of automatically and accurately ingesting data from a diverse set of traditional, legacy, and big data sources into the data lake (on HDFS) can only be addressed by custom coding or tooling. Even with tooling, understanding the challenges of data validation, character set conversions and history management, just to name a few, are often not fully understood, or worse; neglected all together. The open source projects also don’t address providing business users an easy way to collaborate together to create and share insights about data in the lake through crowd-sources business metadata, analytics and data views.