Posted by on
Categories: Hadoop Oracle

As part of its #BigData Cloud Service, @Oracle provides a set of internal and external tools designed to help users efficiently deploy and manage #Hadoop-based big data systems. Oracle’s Big Data Cloud Service offers enterprise users a platform on which to quickly and easily implement a big

The service utilizes Oracle’s cloud infrastructure and other technologies, from Oracle and elsewhere, to provide a complete environment to set up, manage and elastically scale Hadoop clusters through a centralized portal. Big Data Cloud eliminates many of the cluster implementation complexities for Oracle Hadoop users by providing the tools necessary to deploy a system, secure its environment and integrate it with other services.

The heart of the service doesn’t come from Oracle itself; it comes from Cloudera Inc.’s CDH distribution of Hadoop and related big data tools, which together comprise a scalable, integrated architecture for managing massive volumes of heterogeneous data.

For those not in the know, Hadoop is an open source framework for building distributed processing systems across clusters built on commodity hardware. Because of its distributed architecture, Hadoop can effectively manage petabyte-scale data sets and support sophisticated analytics while controlling security, governance and data access.

Hadoop and more

CDH includes four core Hadoop modules that help facilitate storage and processing operations: Hadoop Common, a set of utilities that supports the other modules; the Hadoop Distributed File System (HDFS), which can store a mix of structured, semistructured and unstructured data; the Hadoop Yarn job scheduler and cluster resource manager; and the MapReduce processing engine and programming framework.

In addition to the core Hadoop components, CDH offers a number of other Apache technologies that work in conjunction with Hadoop to expand on or add to its capabilities. Many of them are integrated into the Big Data Cloud Service along with the Hadoop distribution.

One of the most important is the Apache Spark processing engine, which supports a wide range of operations, including data transformations, machine learning, batch and real-time stream processing, and advanced modeling and analytics. IT teams often use Spark as a batch processing engine rather than MapReduce because of Spark’s flexibility and in-memory processing capabilities, which offer significant performance improvements over MapReduce.

The Apache technologies supported by Big Data Cloud also include:

HBase, a nonrelational, key-value data store for handling large data sets distributed across HDFS clusters;Hive, a data warehouse infrastructure built on top of Hadoop deployments that supports analytics, data summarization and ad hoc queries against large data sets;Oozie, a workflow scheduler for managing Hadoop jobs;Pig, a data flow language and execution framework for performing complex analytics, aggregations and transformations against large data sets;Sqoop, a tool for transferring bulk data between structured data stores and Hadoop clusters; andZooKeeper, a coordination service for maintaining and synchronizing configuration and naming information for distributed applications.

Like other cloud-based services, Big Data Cloud evolves quickly, so the list of supported Apache tools will likely change over time. Refer to Oracle’s documentationto view the most current list of what’s available as part of the service to support Oracle Hadoop deployments.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.