APP-CAP2956: Inside the Hadoop Machine

This is a session blog for APP-CAP2956, Inside The Hadoop Machine. Presenting are Jeff Buell (VMware), Richard McDougall (VMware), and Sanjay Radia (Hortonworks).

Some use cases for Hadoop include log processing, click stream analysis, machine learning, web crawling, and image/XML processing. But what is Hadoop? Hadoop is a programming framework for enabling parallel processing; this enables distributed processing of large data sets across clusters of computers. Hadoop also incorporates that machines will fail, and is tolerant of those failures. Hadoop works by taking an input file, breaking it up into pieces (the size of those pieces is driven by the HDFS block size), and giving those pieces out to the systems in the Hadoop cluster. The first phase is the map phase; then comes the reduce phase, where the results are sorted and merged, then given the output file. (Hence the name MapReduce.)

Hadoop consists of two pieces: MapReduce, which is the programming framework for parallel processing; and the Hadoop Distributed File System (HDFS), which provides distributed data storage for the MapReduce jobs. HDFS is not POSIX-compliant; it behaves more like Amazon S3 or EBS.

HDFS has two components: a name node and a data node. The name node is responsible for the directory hierarchy and the namespace. HDFS 1 only has a single name node; this represents a single point of failure for the filesystem. Data nodes store the actual data blocks, and provide replication for the blocks as well in the event of failure (typically 3 copies are made, and the third copy will be in a separate rack/fault domain).

Hadoop is topology-aware, meaning that it knows how to distribute jobs and tasks across nodes and across racks and data centers. That’s fine for physical environments, but doesn’t necessarily translate well to a virtualized environment.

Some challenges with enterprise deployments of Hadoop include:

  • Slow to provision, complex to keep running
  • Single points of failure with name node and job tracker
  • No HA for Hadoop Framework Components (such as Hive, HCatalog, etc.)
  • No easy way to share resources between Hadoop and non-Hadoop workloads
  • Lack of resource containment
  • No multi-tenancy or security/performance isolation functionality
  • Lack of configuration isolation (can’t run multiple versions on the cluster)

Virtualization presents itself as a potential solution to help address these enterprise Hadoop challenges.

This leads to a discussion of Project Serengeti, which (in its first release) helps automate the process of provisioning a virtual Hadoop cluster in just a matter of minutes. Serengeti deploys as a single vApp. After logging into the vApp (the CLI is similar to Hadoop), issue the “cluster create” command, and Serengeti will go deploy a fully-functional 3 node Hadoop cluster. Serengeti is driven by a JSON file that specifies the distribution of Hadoop, the cluster size, the type of storage (shared or local), etc. Serengeti supports a number of commercial and open source projects around Hadoop.

Hadoop relies heavily on the storage subsystem. Although local storage is extraordinarily cheap and unreliable, but when used in conjunction with something like HDFS it does look quite attractive. In vSphere environments, though, the most common form of storage is SAN/NAS. How does this work when you bring Hadoop into the mix? Using a hybrid model that leverages both shared and local storage. Hadoop seems to generate the majority of the disk I/O for temporary storage; you can use the hybrid model to store reliable data on SAN/NAS but the working temporary data on local storage. Recall that HDFS protects the data by creating multiple copies of the data, so failure of a local host (and it’s associated storage) isn’t that big of a deal.

Four common performance concerns that come up with running Hadoop on vSphere:

  1. How well does Hadoop run virtualized?
  2. Performance implications of running Hadoop on shared storage?
  3. What’s the effect of protecting Hadoop master daemons using FT?
  4. Should we rent (public) or build (private) for running Hadoop?

Quick comparisons of Hadoop on native hardware versus Hadoop virtualized shows a small decrease in performance with only a single VM, but when the workload is divided into 2 or 4 VMs then performance returns closer to native performance. This was especially true with the TeraSort workload VMware used in their testing. With regard to storage, local storage was faster in all instances, with the exception of the TeraValidate workload working against a JBOD configuration on the SAN. There were significant performance impacts when SAN RAID overhead was introduced. This was especially true for the TeraGen workload, which was a write-heavy workload.

For FT usage (question #3 above), the testing showed a 2-4% slowdown for TeraSort when FT was enabled for the name node and job tracker nodes.

The cost comparison showed a vSphere implementation that was significantly smaller in size, but only 5x more time required to run TeraSort, and able to do it at 1/3 the cost.

The presentation now shifts to a discussion of some joint development work between VMware and Hortonworks, specifically around efforts to provide HA (high availability) for Hadoop components. If the name node fails, the entire job fails and has to be restarted; thus, the name node is typically protected with great effort. Would vSphere HA be enough to protect the name node? Normally jobs would fail. What VMware and Hortonworks have done is add some Hadoop-specific awareness (a JVM monitor to serve as a watchdog for the name node services) as well as add some pause/retry functionality for Hadoop-enabled applications. The Hortonworks HDP 1.0 release has these patches and the patches have been pushed upstream (to Apache, I believe). This means that name nodes (or job trackers) that you want to protect with HA need to be on shared storage, but worker nodes (which can be physical or virtual) will use local storage.

The third and final topic for Hadoop on vSphere involves elastic scaling. The solution that VMware came upon was to separate the job node and the data node into separate VMs; this enabled multi-tenancy as well as fixing some of the “content isolation” and “configuration isolation” issues mentioned earlier as well.

At this point, the session wrapped up.

Tags: , , ,

  1. John Strange’s avatar

    Great info!

    Do you have any comment on Cassandra vs Hadoop? I am trying to wrap my head around the idea and my developers gave me this:

    Cassandra:
    1) It clusters better (one might say that its “easier” to cluster).
    2) It doesn’t have a single point of failure (all nodes are masters — whereas if the NameNode in HBase dies, the whole thing is down).
    3) Querying is built-in (CQL).
    4) Bigger community.

    Although to me it seems Hadoop is stronger, but I really do not know the right questions to ask to determine which to use. From your post it sounds like there is a strong effort to make clustering easier and protect NameNode.

  2. slowe’s avatar

    John, I don’t have enough experience or expertise to provide a detailed answer to your question, but I can confirm—based on the presentation—that there is indeed a fair amount of effort being expended to protect the name node and alleviate that potential single point of failure. Perhaps other readers could weigh in on the Cassandra-v-Hadoop discussion?

  3. Carter Shanklin’s avatar

    John,

    It will help to clarify one thing for starters:

    Cassandra is a NoSQL database in the Dynamo model.

    Hadoop is a data platform that supports many different data management solutions. At its heart Hadoop is a distributed filesystem (HDFS) and a distributed compute platform (through Map-Reduce). On top of these things people have built a number of interfaces. For instance there is Hive which looks very SQL-like, there is Pig which allows data processing in a procedural way. There is also a NoSQL store called HBase which is in the BigTable model. There are other up-and-coming data management system and tools for Hadoop which are not very mature yet. With Hadoop you get a lot of options.

    Overall in the “new style data management” space there’s nothing that’s extremely mature, you need to pick the solution that fits your problem. Cassandra focuses on being high-throughput and strives to always be write-available, meaning you can write to it almost regardless of how bad things get. Is that what you need? Hadoop focuses on accepting and processing any amount of data on commodity hardware and offers a wide variety of interfaces to store and access the data. Is that what you need?

    Cassandra has a quality of “no special nodes”. Deployment and basic scale-out of Cassandra is quite easy. But don’t extrapolate too far, Cassandra operation can be quite tricky at scale, particularly if you’re using quorum consistency to speed things up or doing multi-geo. There are questions of anti-entropy and node repairs that people don’t like to think much about.

    Hadoop 1 does have SPOF problems. When Hadoop 2 becomes stable enough for production these problems go away. In the interim there are also solutions. For instance, I work for Hortonworks and we have built high availability for Hadoop that takes advantage of either VMware HA or RHEL HA to protect Hadoop’s critical components.

    Community size: Hadoop’s is bigger, although Cassandra’s community is quite robust. The comment your devs made may have been made specifically about HBase rather than Hadoop.

    The fact that Hadoop is a data platform is possibly the most important thing. It’s common that Hadoop users use multiple tools in the stack on the same data, they might write the data in using HBase, then query it using Hive. I’m even aware of a company that uses Cassandra for storing sensor data, moves it into a Hadoop cluster, then does long-running queries there. CQL is nice, but it won’t get you very far in terms of deep insight.

    Again, the key thing is to understand what you need and match against what the solutions offer.

  4. Dan Baskette’s avatar

    Cassandra really compares with HBase. HBase and Cassandra are KeyValue datastores based in some part on the Google BigTable project. HBase and Cassandra are both NoSQL databases that are aimed at transactional processing, whereas Hadoop (HDFS/MapReduce) is aimed at sequential batch processing. So, it all comes down to what you are trying to accomplish. Worry less about the infrastructure and more about the application, there already multiple options for Highly Available namenode in Hadoop…. Greenplum MR (OEM of MapR) provides namenode functionality that is distributed across multiple nodes and supports HBase. EMC Isilon supports Namenode/Datanode functionality running on the storage hardware nodes to provide a distributed namenode as well.

  5. John Strange’s avatar

    Carter and Dan thank you very much! I did after posting realize it is really Hbase vs Cassandra, but your explanations really helped me put this all together. I will push the Dev team to explain what they need!

  6. Dheeraj Pandey’s avatar

    Cassandra and HBase are both sub-projects of Hadoop that are NoSQL key-value stores.
    o Cassandra is write-optimized, HBase is read-optimized
    o Cassandra can run on general-purpose Linux filesystems (shared-nothing). HBase uses a distributed filesystem such as HDFS underneath.
    o Cassandra vs. HBase is a religious war: http://b.qr.ae/bmLBcS, http://bit.ly/RPrbWA. WARNING: Don’t tell me you weren’t warned!