http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201009.mbox/%3CA3EF3F6AF24E204B812D1D24CCC8D71A03688F76@mse16be2.mse16.exchange.ms%3E
Thanks for the responses, I especially appreciate the details Matthew! Just for the record, I appreciate that having multiple DataNodes on a single machine defeats the purpose or the advantages given by having them spread across machines across racks. I intend to go to that model as we grow. Cheers Arv -----Original Message----- From: Matthew Foley [mailto:mattf@yahoo-inc.com] Sent: September 15, 2010 8:45 PM To: common-user@hadoop.apache.org Cc: Matthew Foley Subject: Re: Multiple DataNodes on a single machine Hello Arv, It is possible to run multiple datanodes on a single machine, and this can be useful for small-scale test scenarios. Also you mentioned in your previous message that you have a Hadoop implementation with only one physical datanode server and want to replicate within it, between spindles. This also makes sense, and will work. Of course, if you have two datanodes running you will get only order-2 replication, not order-3, even if the replication has been set to 3. I will describe the config in a moment, but I would first like to point out that in clusters with even a few datanode servers, one is better off with cross-server replication. Without cross-server replication, losing the System disk will make ALL data volumes unavailable. And of course, multiple datanodes running on one server will compete for cores, NICs, bus, and memory access, even if not for spindles. A previous responder suggested running two namenodes also, but it wasn't clear whether he meant two primaries or one primary and one secondary/checkpoint nameserver. The latter is fine, but running two primary namenodes is definitely not the thing to do! Anyway, here's how you set it up. I have done this recently with v0.21.0, with two datanode processes in a single box (along with namenode sharing the same box), and it did replicate correctly between the two. I haven't tried it with > 2 datanodes, and I don't know what the impact on process efficiency would be, but that would probably work too. 1. In your HADOOP_HOME directory, copy the "conf" directory to, say, "conf2". 2. In the conf2 directory, edit as follows: a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. ${USER}_02 b) In hdfs-site.xml, change dfs.data.dir to show the desired targets/volumes for datanode#2, and of course make sure the corresponding target directories exist. Also remove these targets from the dfs.data.dir target list for datanode#1 in conf/hdfs-site.xml. c) in hdfs-site.xml, set the four following "address:port" strings to something non-conflicting with the other datanode and other processes running on this box: - dfs.datanode.address (default 0.0.0.0:50010) - dfs.datanode.ipc.address (default 0.0.0.0:50020) - dfs.datanode.http.address (default 0.0.0.0:50075) - dfs.datanode.https.address (default 0.0.0.0:50475) Note: the defaults above are what datanode#1 is probably running on. I added 2 to each port number for datanode#2 and it seemed to work okay. You might also wish to note the default ports associated with the namenode and job/task tracker processes, in case they are running on the same box: - fs.default.name 0.0.0.0:9000 - dfs.http.address 0.0.0.0:50070 - dfs.https.address 0.0.0.0:50470 - dfs.secondary.http.address 0.0.0.0:50090 - mapred.job.tracker.http.address 0.0.0.0:50030 - mapred.task.tracker.report.address 127.0.0.1:0 - mapred.task.tracker.http.address 0.0.0.0:50060 3. At this point, launching with: bin/hdfs --config $HADOOP_HOME/conf2 datanode will work. To make it convenient to launch as a service, you can add a couple lines to the end of the bin/start-dfs.sh script like: HADOOP_CONF_DIR2=$HADOOP_HOME/conf2 "$HADOOP_COMMON_HOME"/bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR2 --script "$bin"/hdfs start datanode $dataStartOpt Hope this helps, --Matt On Sep 15, 2010, at 8:50 AM, Arv Mistry wrote: Hi, Is it possible to run multiple data nodes on a single machine? I currently have a machine with multiple disks and enough disk capacity for replication across them. I don't need redundancy at the machine level but would like to be able to handle a single disk failure. So I was thinking if I can run multiple DataNodes on a single machine each assigned a separate disk that would give me the protection I need against disk failure. Can anyone give me any insights in to how I would setup multiple DataNodes to run on a single machine? Thanks in advance, Cheers Arv