Posted by on
Categories: AI Amazon Apache ARTIFICIAL INTELLIGENCE google Hadoop Machine Microsoft

Since its introduction six years ago, @Apache #Hadoop has been the #analytics platform of choice for #bigdata workloads in areas such as #genomics, #machinelearning, #artificialintelligence ( #AI) and nearly anything that requires sifting through massive amounts of data. Such applications require large data sets that continue to grow in size and number, requiring extremely high throughput to process the data as quickly as possible. Data is first loaded into Hadoop’s file system (HDFS) from a central data store, meaning that large amounts of network traffic is generated that adversely impacts network performance for other applications and users. Besides being extremely inefficient, moving data requires additional administrative oversight. While still a powerful platform, Hadoop and HDFS are beginning to show their age and may quickly fade as IT organizations adopt new software based storage architectures, inexpensive NVM Express (NVMe) flash devices, and upgrade networks to run 10 to 100 times faster. Networks are becoming less of a bottleneck to performance; however, the Hadoop file system itself causes several performance problems. For example, HDFS uses a very simplistic data protection scheme by maintaining three full copies of the source data (known as triple replication). This creates three times more write overhead on the network and requires three times as much storage, power, and cooling. In the original design of HDFS, it was believed that these additional copies could offset some of the inefficiency by placing data near the processing server (known as data locality). However, today’s networks are fast enough that data locality does not matter. In fact, locality hurts performance because some files are accessed more frequently that others. Placing a file near one server puts another server that also needs access to the file at a disadvantage. More important than data locality is the concept of data distribution. Breaking files into small chunks and distributing them across storage devices not only provides better data protection but also minimizes delays caused when applications must wait to access a heavily used file resulting in what is known as a hot spot. Finally, the Hadoop file system was optimized for hard disk drives and very large files. It has difficulty handling modern workloads comprised of very small files such as sensor data and data used for metadata operations, and its triple replication prematurely ages solid state disk systems, whose lifespan is consumed with each write cycle. Hadoop is still a very relevant analytical platform, but like an aging prizefighter, the technology’s better days appear to be behind it. Data is the source of competitive advantage in today’s economy, and modern analytical platforms need a storage efficient, high-performance file system optimized for flash memory that can linearly scale to meet the performance and capacity needs of any workload.  It must support both batch and real-time processing, protect data through snapshots that can be saved to the cloud for disaster recovery, and provide insights into data usage patterns for proper data placement (known as tiering) and management.  Artificial intelligence and machine learning are key drivers of change. Internet of Things (IoT) sensors are widespread in today’s data centers. Physical sensors monitor everything from temperature, humidity, and airflow while digital sensors monitor network traffic and security, server utilization, and power and storage consumption. Such monitoring applications create millions of files daily, all of which must be analyzed. Modern applications in scientific research can generate hundreds of petabytes (PB) of data. While HDFS struggles to keep up, modern flash-native file systems can easily scale to these levels and provide instant access to data regardless of file size. Even though prices for NAND flash continue to decline rapidly, flash memory remains too expensive to be the sole storage medium for big data analytics. The key to sensible economics at scale is to merge the performance benefits of flash with the economics of hard drives. This is best achieved by tiering less active data to object storage, the underlying technology used by Amazon, Goggle, and Microsoft. However, taking advantage of the economics of object storage often requires fundamental changes to existing applications, a painful, costly, and time-consuming task. HDFS was not designed to work with Amazon’s Simple Storage Service (S3), the defacto standard interface for virtually all cloud storage. As new analytic platforms evolve to leverage the performance of flash, these platforms will need a file system that supports both S3 and traditional file interfaces.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.