It’s 3 a.m. — Do You Know What Your Cluster’s Doing?
Performance challenges in #Hadoop environments are par for the course as organizations attempt to capture the benefits of big data. The growing ecosystem of tools and applications (including new analytics platforms) for Hadoop is becoming increasingly distributed, making the challenges of using Hadoop in production exponentially more complicated. Which begs the question, do you know how your cluster will react to this kind of growth and change in usage? It might be operating just fine today, but what happens two months from now once hundreds of new workloads have been added? Distributed systems make optimal performance a particularly hard goal to achieve. When your cluster has hundreds of nodes, and each node has dozens of jobs running independently and using up CPU, RAM, disk and network — and the levels of resource consumption are constantly changing – it quickly becomes an extremely chaotic system. Businesses need a way to bring order to the distributed systems chaos, especially when usages are constantly in flux and dynamically changing, and therefore impact performance. So where does one start? Finding a way to measure Hadoop performance is the first step, and the best way to do that is by establishing Quality of Service (QoS) for Hadoop. QoS provides the ability to ensure performance service levels for Hadoop applications by enabling prioritization of critical jobs. It means that multiple jobs can run side-by-side, safely and effectively, since bottlenecks and contention can be a averted. The types of workloads that typically take priority on the cluster include transactional systems like HBase, data ingest/ETL jobs that carry strict service level agreements (SLAs), and daily runs of critical data products such as direct mailings sent to customers. These types of jobs take priority because they need to maintain consistent performance in the face of other less-critical jobs like ad hoc or analytic workloads. Today’s multi-tenant, multi-workload clusters face major performance issues caused by the fundamental limitations of Hadoop, especially on large scale big data implementations, because after a certain threshold contention is inevitable. The implications of these problems are massive. From a business perspective, companies are wasting time and money trying to fix cluster performance issues that prevent them from realizing the full benefits and ROI of their big data efforts, or gaining any sort of competitive advantage linked to big data initiatives. On the other hand, unreliable Hadoop performance has important technological implications such as late jobs, missed SLAs, overbuilt clusters, and underutilized hardware.