Hadoop employs amaster/slave architecture for both distributed storage and distributedcomputation. The distributed storage system is called the Hadoop File System,or HDFS. The NameNode is the master of HDFS that directs the slave DataNodedaemons to perform the low-level I/O tasks. The NameNode is the bookkeeper ofHDFS; it keeps track of how your files are broken down into file blocks, whichnodes store those blocks, and the overall health of the distributed filesystem.
The function of theNameNode is memory and I/O intensive. As such, the server hosting the NameNode typicallydoesn’t store any user data or perform any computations for a MapReduce programto lower the workload on the machine. This means that the NameNode server doesn’tdouble as a DataNode or a TaskTracker.
There isunfortunately a negative aspect to the importance of the NameNode – It’s asingle point of failure of your Hadoop cluster. For any of the other daemons,if their host nodes fail for software or hardware reasons, the Hadoop clusterwill likely continue to function smoothly or you can quickly restart it. Not sofor the NameNode.
Each slave machinein your cluster will host a DataNode daemon to perform the grunt work of thedistributed filesystem - reading andwriting HDFS blocks to actual files on the local filesystem. When you want toread or write a HDFS file, the file is broken into blocks and the NameNode willtell your client which DataNode each block resides in. Your client communicatesdirectly with the DataNode daemons to process the local files corresponding tothe blocks. Furthermore, a DataNode may communicate with other DataNodes toreplicate its data blocks for redundancy.
The SecondaryNameNode(SNN) is an assistant daemon for monitoring the state of the clusterHDFS. Like the NameNode, each cluster has one SNN, and it typically resides onits own machine as well. No other DataNode or TaskTracker daemons run on thesame server. The SNN differs from the NameNode in that this process doesn’t receiveor record any real-time changes to HDFS. Instead, it communicated with theNameNode to take snapshots of the HDFS metadata at intervals defined by thecluster configuration. Asmentioned earlier, the NameNode is a single point offailure for a Hadoop cluster, and the SNN snapshots help minimize the downtimeand loss of data. Nevertheless, a NameNode failre requires human interventionto reconfiguration the cluster to use the SNN as the primary NameNode.
The JobTrackerdaemon is the liaison between your application and Hadoop. Once you submit yourcode to your cluster, the JobTracker determines the execution plan bydetermining which files to process, assigns nodes to different tasks, andmonitors all tasks as they’re running. Should a task fail, the JobTracker willautomatically relaunch the task, possibly on a different node, up to apredefined limit of retries. There is only one JobTracker daemon per Hadoopcluster. It’s typically run on a server as a master node of the cluster.
Each TaskTracker isresponsible for executing the individual tasks that the JobTracker assigns.Although there is a single TaskTracker per slave node, each TaskTracker canspawn multiple JVMs to handle many map or reduce tasks in parallel. One responsibilityof the TaskTracker is to constantly communicate with the JobTracker. If theJobTracker fails to receive a heartbeat from a TaskTracker within a specifiedamount of time, it will assume the TaskTracker has crashed and will resubmitthe corresponding tasks to other nodes in the cluster.
For small clusters,the SNN can reside on one of the slave nodes. On the other hand, for largeclusters, separate the NameNode and JobTracker on two machines. The slavemachines each host a DataNode and TaskTracker, for running tasks on the samenode where their data is stored.