Posted by on
Categories: Apache Big Data Hadoop

#Bigdata is opening a lot of doors for companies, providing real insights for those that want to understand their customers better and identify new products and services to offer. But big data has a big problem: managing and storing all of that data. When #Apache #Hadoop arrived, it was considered an on-site storage panacea. But then came the proliferation of Hadoop projects, presenting major challenges. First, data scientists were trying to contort Hadoop to do things it wasn’t built to do. Second, they discovered they had multiple copies of data across silos, with each line of business creating its own “version of the truth” through multiple iterations and transformations. And finally, given that traditional Hadoop could be scaled in tightly coupled blocks of storage and compute, enterprises found themselves over-provisioned on compute when all they needed was more storage. In response, many companies have created a “data lake” into which they pour raw, unfiltered, untreated data into a single vast pool. Often that data lake is located in a public cloud, which has the scalability to handle this ever-expanding pool, and corporate users can utilize analytics tools offered by the cloud provider. Some users have access to analytics tools to retrieve data from the cloud to run analytics on-premises. Or companies are investing in on-premises storage that users can draw from to run analytics. Data Gravity Unfortunately, all of these variations have drawbacks. Companies find it expensive to handle the high volume of data generated every day, not all of it equally valuable. If the data is stored in a public cloud, the cost of transferring it into, and especially out of, the cloud for analytics is both costly and unpredictable

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.