Data Lakes vs. Data Warehousing – How Each Works in the Digital Technology Boom
Let’s first define #datalake. James Dixon, the founder and CTO of Pentaho, has been credited with coming up with the term. This is how he describes a data lake: “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in or take samples.” These problems are often referred to as information siloing. PricewaterhouseCoopers mentioned that data lakes could “put an end to data silos.” In their study on data lakes, they noted that enterprises were “starting to extract and place data for analytics into a single, Hadoop-based repository.” Data warehouse was coined by William H. Inmon in the 1970s. Inmon, known as the Father of Data Warehousing, described a data warehouse as being “a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management’s decision-making process.” Emmett Torney of DATUM said, “Smart devices, hyper connectivity, supercomputing and cloud are quickly changing the world we live in and the way companies conduct business. All of these technological drivers are being fuelled by one important asset: Data.” In her article “Data Lake vs Data Warehouse: Key Differences“, Tamara Dull, Director of Emerging Technologies at SAS Institute shares key differences between data warehouse and data lake.