Data Warehouse and Data Lake Analytics Collaboration
This blog was written with the thoughtful assistance of @DavidLeibowitz, @DellEMC Director of #BusinessIntelligence, #Analytics & @BigData So data warehousing may not be cool anymore, you say? It’s yesterday’s technology (or 1990’s technology if you’re as old as me) that served yesterday’s business needs. And while it’s true that recent big data and #datascience technologies, architectures and methodologies seems to have rendered #datawarehousing to the back burner, it is entirely false that there is not a critical role for the data warehouse and Business Intelligence in digitally transformed organizations. Maybe the best way to understand today’s role of the data warehouse is with a bit of history. And please excuse us if we take a bit of liberty with history (since we were there for most of this!). Phase 1: The Data Warehouse Era Phase 1: In the beginning, Gods (Ralph Kimble and Bill Inmon, depending upon your data warehouse religious beliefs) created the data warehouse. And it was good. The data warehouse, coupled with Business Intelligence (BI) tools, served the management and operational reporting needs of the organization so that executives and line-of-business managers could quickly and easily understand the status of the business, identify opportunities, and highlight potential areas of under-performance (see Figure 1). Figure 1: The Data Warehouse Era The data warehouse served as a central integration point; collecting, cleansing and aggregating a variety of data sources from AS/400, relational and file based (such as EDI). For the first time, data from supply chain, warehouse management, AP/AR, HR, point of sale was available in a “single version of the truth.” Using extraction-transform-load (ETL) processing wasn’t always quick, and could require a degree of technical gymnastics to bring together all of these disparate data sources. At one point, the “enterprise service bus” entered the playing field to lighten the load on ETL maintenance, but routines quickly went from proprietary data sources, to proprietary (and sometimes arcane) middleware business logic code (anyone remember Monk?). The data warehouse supported reports and interactive dashboards that enabled business management to have a full grasp on the state of the business. That said, report authoring was static and not really enabled for democratizing data. Typically, the nascent concept of self-service BI was limited to cloning a subset of the data warehouse to smaller data marts, and extracts to Excel for business analysis purposes. This proliferation of additional data silos created reporting environments that were out of sync (remember the heated sales meetings where teams couldn’t agree as to which report figures were correct?) and the analysis paralysis caused by spreadmarts meant that more time was spent working the data rather than driving insight. But we all dealt with it, as it was agreed that some information (no matter the effort it took to acquire) was more important that no data. Phase 2: Optimize the Data Warehouse But IT man grew unhappy with being held captive by proprietary data warehouse vendors. The costs of proprietary software and expensive hardware (and let’s not even get started on user-defined functions in PL/SQL and proprietary SQL extensions that created architectural lock-in) forced organizations to limit the amount and granularity of data in the data warehouse. IT Man grew restless and looked for ways to reduce the costs associated with operating these proprietary data warehouses while delivering more value to Business Man. Then Hadoop was born out of the ultra-cool and hip labs of Yahoo. Hadoop provided a low-cost data management platform that leveraged commodity hardware and open sources software that was an estimated to be 20x to 100x cheaper than proprietary data warehouses. Man soon realized the financial and operational benefits afforded by a commodity-based, natively parallel, open source Hadoop platform to provide an Operational Data Store (now that’s really going old school!) to off-load those nasty Extract Load and Transform (ETL) processes off the expensive data warehouse (see Figure 2). Figure 2: Optimize the Data Warehouse The Hadoop-based Operational Data Store was deemed very good as it helped IT Man to decrease spending on the data warehouse (guess not so good if you were a vendor of those proprietary data warehouse solutions…and you know who you are T-man!). Since it’s estimated that ETL consumes 60% to 90% of the data warehouse processing cycles, and since some vendors licensed their products based upon those cycles – this concept of “ETL Offload” could provide substantial cost reductions. So in an environment limited by Service Level Agreements (because outside of Doc Brown’s DeLorean equipped with a flux capacitor, there’s still only 24 hours in a day in which to do all the ETL work), Hadoop provided a low-cost, high-performance environment for dramatically slowing the investment in proprietary data warehouse platforms. Things were getting better, but still weren’t perfect. While IT Man could shave costs, he couldn’t make the tools easy to use by simple data consumers (like Executive Man). And while Hadoop was great for storing unstructured and semi-structured data, it couldn’t always keep up to the speed relied upon for relational or cube based reporting from traditional transactional systems. See blog “The Data Warehouse Modernization Act” for more details on the role of the Hadoop-based Operational Data Store and how it has helped to “modernize” today’s existing data warehouse environment. Phase 3: Introducing Data Science Then God created the Data Scientists, or maybe it was the Devil based upon one’s perspective. The data scientists needed an environment where they could rapidly ingest high volumes of granular structured (tables), semi-structured (log files) and unstructured data (text, video, images). They realized that data beyond the firewall was needed in order to drive intelligent insight. Data such as weather, social, sensor and third party could be mashed up with the traditional data stores in the EDW and Hadoop to determine customer insight, customer behavior and product effectiveness. This made Marketing Man happy. The scientists needed an environment where they could quickly test new data sources, new data transformations and enrichments, and new analytic techniques in search of those variables and metrics that might be better predictors of business and operational performance. Thusly, the analytic sandbox, which also runs on Hadoop, was born (see Figure 3). Figure 3: Introducing Data Science The characteristics of a data science “sandbox” couldn’t be more different than the characteristics of a data warehouse: Finance Man tried desperately to combine these two environments but the audiences, responsibilities and business outcomes were just too varying to create an cost-effectively business reporting and predictive analytics in single bubble. Ultimately, the analytic sandbox became one of the drivers for the creation of the data lake that could support both the data science and data warehousing (Operational Data Store) needs. Data access was getting better for the data scientists but we again were moving towards proprietary process and a technical skill reserved for the elite. Still, things were good as IT Man, Finance Man and Marketing Man could work through the data scientists to drive innovation. But they soon wanted more. See the following blogs for more details on the complementary nature of the data warehouse and the data lake: “Dynamic Duo of Analytic Power” “Why Do I Need A Data Lake?” Phase 4: Creating Actionable Dashboards But Executive Man was still unsatisfied. The Data Scientists were developing wonderful predictions about what was likely to happen and prescriptions about what to do, but the promise of self-service BI was missing. Instead of the old days, and having to run to IT Man for reports, now he was requesting them of the Data Scientist. The reports and dashboards created to support executive and front-line management in Stage 1 were the natural channel for rendering the predictive and prescriptive insights, effectively closing the loop between the data warehouse and the data lake. With data visualization tools like Tableau and Power BI, IT Man could finally deliver on the promise of self-service BI by providing interactive descriptive and predictive dashboards that even Executive Man could operate (see Figure 4). Figure 4: Closing the Analytics Loop See the blog “Creating Actionable Dashboards” for more details on how to convert existing reports and dashboards into actionable reports and dashboards! And Man was happy (until the advent of Terminator robots began making decisions for us).