During his remarks in the General Session at PEX 2014 VMware CTO Ben Fathi said that Hadoop clusters were part of the 20% of the enterprise data center not being virtualized. This makes a lot of sense in a use case where you have a very large data set that is either static or slowly growing. However due the adoption of ideas like “The Internet of Things”, more companies are using Hadoop to process large, bursty, time-sensitive data. The elasticity the software defined data center is perfect for this use case.
Another reason to virtualize Hadoop is availability.There are three kinds of nodes in Hadoop: the Datamaster – runs the NameNode service, the ComputeMaster – runs the Job Tracker service, and Worker Nodes – they store HDFS data (DataNodes) and run the Task Tracker service. In the 1.x implementation of Hadoop you can have 1 NameNode, 1 Job Tracker, and N DataNodes. Because HDFS stores data in triples (each block is written to three DataNodes) they are it is inherently fault tolerant. However the NameNode and Job Tracker are single points of failure. Running these nodes on vSphere with HA compensates for that limitation.
Note: The single point of failure issue is being addressed in 2.0.
You can run Hadoop nodes on vSphere out of the box if you build your cluster manually, but there is a better way, vSphere Big Data Extensions, included in the vCloud Suite. vBDE allows you to provision entire clusters via a simple wizard. It also supports almost every major Hadoop distribution. In the following sections I will cover how to deploy vBDE and create a cluster using the default Apache Hadoop template.
Deploying vSphere Big Data Extensions
Installing vBDE involves downloading the .ova file and deploying the vApp. Inside the vApp there are two VMs, the Serengeti Management Server and a generic Hadoop template.
Once the managment-server VM has started, open the web page https://management-server-ip-address:8443/register-plugin .
Leave the “Install” radio button selected, enter the connection information for your vCenter and click “Submit”. The will register register the vBDE plugin with vCenter.
If you have vCenter open, close it and then reopen. the Big Data Extensions icon is now available under Inventory.
Now that the Plugin is installed you need to connect to the Management Server . Click on the vBDE icon then click the Summary tab. In the Connect Server box, click on “Connect Server…”
Navigate the tree to management-server, select it and click OK.
The server should add quickly. Now select Resources in the left pane under Inventory Lists and add the Datastores and Networks you wish to make available to vBDE
Now click on Back to Big Data Extensions and select Big Data Clusters under Inventory Lists. Click the New Big Data Cluster icon.
On Page 1 you Name the Cluster, choose your distribution (by default vBDE is deployed with Apache Hadoop 1.2.1 only), choose your deployment type and size the nodes. Then click Next.
On Page 2 you define the Topology and Network. “HOST_AS_RACK” means that if possible HDFS will store each part of a data “triple” on a different ESXi host.
On Page 3 select the resource pool. This is tricky. You can select any pool, however vBDE is not supported on Child Pools, so you need to pick a cluster.
On Page 4 you select a cluster password. If you set your own only “a-z”, “A-Z”, “0-9”, and “_” are valid characters and it must be at least 8 characters long.
On Page 5 confirm your choices and click “Finish”.
Wait a few minutes and voila! You have a fully functional Hadoop cluster. Inject your dataset and query away!
Stay tuned for Part 2 where I discuss how to add other supported distributions of Hadoop to vSphere Big Data Extensions.