Posted by on
Categories: google Hadoop MapReduce Yahoo

#Hadoop was born in the cloud, as a #bigdata project designed to take advantage of #Yahoo ’s infrastructure, based on a paper from #Google about the #mapreduce algorithm. It was never originally designed for on prem enterprise deployment. I recently wrote that some of the issues with Hadoop adoption are cultural, but management overheads are another problem – setting up clusters, capacity planning, maintaining and updating versions etc. Keeping cluster running, the basic care and feeding that any distributed system requires. Another management issue is the sheer pace of innovation in open source big data tooling. Hadoop is great for counting and sorting in batch mode and Hadoop Data File System (HDFS) is a powerful data reservoir, but then what? The query infrastructure is still immature, and targeted at highly skilled and expensive data scientists with language skills rather than common or garden SQL tooling. As data workloads increasingly became stream-based, Spark took off. Then we needed a message bus, and Kafka emerged as the platform of choice. But how is all of this stuff supposed to fit together? Enterprises, in general, don’t want to be systems integrators (except of course, the ones that do) and prefer to outsource the packaging of technology to third parties. Cloudera, Hortonworks, Map/R were founded as Hadoop distribution providers, and have responded to the rate of change issue but broadening their story and embracing Spark and Kafka, positioning themselves as broad next generation data platforms But many early deployments were on prem, which meant management overheads remained, especially in a world where new software versions off all of the pieces of the stack are emerging at a furious pace. Updating on prem software sucks. Even in the cloud though it promises elastic scalability, capacity planning is an issue – what happens when your Hadoop cluster grows out of the sizing you have set up on AWS Cloud? Hadoop and associated tooling carries a fairly significant management overhead. While the distribution players can mitigate these issues to some extent, the alternative is managed serviced from AWS, Azure and Google Cloud Platform (GCP). If lighthouse customers is anything to go by, Google may have found a sweet spot in picking up customers that are fed up with with running Hadoop, and are looking for an integrated set of offerings that don’t carry a management overhead. Google has its ducks in a row from a packaging perspective. One of the first major GCP wins for its Big Data services was Spotify, which said loud and clear it was willing to trade openness for convenience and extra capability. HSBC spent tens of millions of pounds standing up its own Hadoop infrastructure for anti-money laundering (AML) but was disappointed with the results. It has now migrated to GCP for AML, and is beginning to migrate other workloads there, such as finance liquidity reporting. The strategy is Cloud First and GCP is really well positioned there. Ocado is an online grocery delivery platform, which also offers third party digital and fulfilment services. It’s a classic platform play and has gone all in on Google Cloud for data. Interestingly it would be even more aggressive about adopting Google infrastructure were it not for corporate restrictions on adopting beta versions.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.