Monday, January 7, 2013

Multinode HDInsight cluster




Source: http://social.msdn.microsoft.com/Forums/en-US/hdinsight/thread/885efc22-fb67-4df8-8648-4ff38098dac6

Avatar of Espen S
0 Points
 
  Has Code
I've setup HDInsight to work with multiple nodes in a lab environment. (Meaning I've ignored security etc. And there's no easy way to control the start-up and shutdown of the nodes in the cluster)

Setting up a multi node cluster on Windows with HDInsight is not significantly different from setting up a multi-node Hadoop cluster on any other platform. I do recommend reading the following link:  http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/


Breaking it down to a few simple steps:
1. Install HDInsight on your master (hmaster) and your nodes (hnode1..hnodeN) For ease of configuration make sure the master and the nodes have static IPs.

2. Turn off all firewalls and other port blocking software on all nodes. (Make sure the machines have access to each other)

3. On both the master and on ALL nodes edit %WinDir%\system32\drivers\etc\hosts and add the following lines

<ip-to-master> hmaster
<ip-to-node1> hnode1
......
<ip-to-nodeN> hnodeN

(Replace <ip-to-xxx> with the ip-address of the nodes)

4. On the master edit the file masters (Found in C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\ on my setup)

And make sure the only line in the file is the name of the master (must match what you just entered in the hosts file):
  hmaster

5. On the master edit the slaves file:
(Found in C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\ on my setup)
Add the names of all the hnodes, names must match what you just entered in the hosts file

hnode1
...
hnodeN

6. On the master and on ALL nodes edit the file core-site.xml (Found in C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\ on my setup)

Find the property fs.default.name and change it's value to hdfs://hmaster:8020.

7. On the master and on ALL nodes edit the file mapred-site.xml (Found in C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\ on my setup)

Modify all values that make references to localhost to reference hmaster instead.

8. On the master and on ALL nodes edit the file hdfs-site.xml (Found in C:\Hadoop\hadoop-1.1.0-SNAPSHOT\conf\ on my setup)

Modify all values that reference localhost to reference hmaster instead.

Also make sure that the value of property dfs.replication is less or equal to the total number of nodes in your system. (Note that the description on this property seems to be wrong in the HDInsight supplied conf files)

9. Now go to the master and all nodes open a command prompt and execute the shutdown script stop-onebox.cmd. (Found in C:\Hadoop) .

10. During the HDInsight install all the nodes were installed as namenodes and the HDFS filesystem was initialized upon the first service startup. So now the file systems on each node has different IDs. If you try to start your system now you'll get some errors in the log saying something about invalid IDs. The easiest way to fix this on a fresh system is to simply delete the hdfs folder and restart the nodes.

A simple way to fix this is to delete the HDFS folder (c:\Hadoop\hdfs on my system) on each node and let the system recreate the folder on start-up.

11. Log on to the master, start a command prompt and invoke the start-up script (c:\Hadoop\start-onebox.cmd)

Repeat Step 11 for all nodes.

Monitor your namenode status by navigating to
http://hmaster:50070/

And your mapreduce status  at
http://hmaster:50030/jobtracker.jsp

There're several drawbacks to this approach, and you should never go live with this setup. One thing is security, another is that all the nodes are configured to be namenodes and they run a few services that should not be on the nodes. The installation includes scripts to start the system as a node but I've tried to keep this walkthrough as simple as possible.

I do recommend starting from scratch when setting up a multi-node cluster, and not migrating from a single node cluster to a multi node cluser. I ran into several issues with hive and the metadata (that it by default stores in derby) when doing this. Hive stores the full hdfs path in the metadata tables (DB_LOCATION_URI column in DBS and LOCATION in SDS). So if you change from a single node to a multinode configuration you're likely to run into the issue of hive trying to access the data on the localhost. To fix the issue you need to modify the data in the derby tables.



No comments:

Post a Comment

I would be glad to know if this post helped you.