hadoop on windows with eclipse

 

http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20-%20Intro.html

 

Introduction

Hadoop is a powerful framework that allows for automatic parallelezation of computing task. Unfortunately programming for it poses certain challenges, namely it is really hard to understand and debug Hadoop programs. One way to easy things a little bit is to have a simplified version of the hadoop cluster that could run locally on the developer's machine. This tutorial describes how to set-up such cluster on the computer running Microsoft Windows, also it describes how to integrate this cluster with the Eclipsedevelopment environment. Eclipse is a prime environment for Java development.

How this tutorial is organized.

Since hadoop is very complex environment I broke this tutorial down into several smaller steps. Each step involves setting up some aspect of the system and ways to verify that these actions were executed correctly. For the purposes of better understanding each tutorial is accompanied by the screenshots and a recorded video of the steps.

 

Questions, Suggestions Comments

You can leave questions, suggestions and commends about this tutorial here

 

 

 

 

Prerequisites

Before we begin, make sure the following components are installed on your workstation:

This tutorial had been written for and tested with the Hadoop version 0.19.1 if you are using another version some things might not work for you.

Make sure that you have exactly the same versions of the software as shown above. Hadoop will not work with versions of Java prior to 1.6 and it will not work with the versions of Eclipse after 3.3.2 due to plugin API incompatibility.

 

Installing Cygwin

After you made sure that the above prerequisites are installed the next step would be to install the cygwin environment. The cygwin is a set of UNIX packages ported to Microsoft Windows. It is needed to run Hadoop supplied scripts since they are all written for the UNIX platform.

To install the cygwin environment follow these steps:

  1. Download cygwin installer from here.
  2. Run the downloaded file. You will see the window shown on the screenshots below.


    Cygwin installer

    Cygwin Installer
  3. When you see the above screen shot keep pressing 'Next' button until you see the package selection screen shown below. Make sure that you have package 'openssh' selected. This package is required for the correct functioning of the hadoop cluster and eclipse plugin.

Click here to see larger version

  1. After you selected these packages press the 'Next' button to complete the installation.

 

 

 

 

 

 

Set Environment Variables

The next step is to set up the PATH environment variables, so that eclipse IDE could access cygwin commands.

To set environment variables follow these steps.

  1. Find "My Computer" icon either on the desktop or in the start menu, right click on it, and select Properties item from the menu.
  2. When you see the properties dialog box, click on the Environment Variablesbutton as shown below.

    Click on environment variables button
  1. When Environment Variables dialog shows up, click on the Path variable located in the System Variables box and click the Edit button.

  2. When Edit dialog appears append the following text to the end of variable value field:

    c:/cygwin/bin;c:/cygwin/usr/bin
    Note: If you installed cygwin in the non-standard location, correct the above value accordingly.
  3. Close all three dialog boxes by pressing OK button of each dialog box.

 

 

 

 

Setup SSH daemon

Both hadoop scripts and eclipse plugin need password less ssh to operate. This section describes how to set it up in the Cygwin environment.

Configure ssh daemon

  1. Open Cygwin command prompt
  2. Execute the following command
    ssh-host-config
  3. When asked if privilege separation should be used, answer no.
  4. When asked if sshd should be installed as a service, answer yes.
  5. When asked about the value of CYGWIN environment variable enter ntsec.
  6. Here is the example session of this command, note that the input typed by the user is shown in pink and output from the system is shown in gray.

    Example of using ssh-host-config

Start SSH daemon

  1. Find my computer icon either on your desktop or in the start-up menu, right-click on it and select Manage from the context menu.
  2. Open Services and Applications in the left-hand panel then select the Servicesitem.
  3. Find the CYGWIN sshd item in the main section and right-click on it.
  4. Select Start from the context menu.

    Start SSHD service

  5. A small window should pop-up indicating the progress of the service start-up. After that window disappears the status of CYGWIN sshd service should change to Started.

Setup authorization keys

Eclipse plugins and hadoop scripts require ssh authentication to be performed through authorization keys rather than through passwords. To enable key based authorization you have to setup authorization keys. The following steps describe how to do it.

  1. Open cygwin command prompt
  2. Execute the following command to generate keys
    ssh-keygen
  3. When prompted for filenames and pass phrases press ENTER to accept default values.
  4. After command has finished generating they key, enter the following command to change into your .ssh directory
    cd ~/.ssh
  5. Check if the keys where indeed generated by executing the following command
    ls -l

    You should see two file id_rsa.pub and id_rsa with the recent creation dates. These files contain authorization keys.
  6. To register the new authorization keys enter the following command. Note that double brackets, they are very important.
    cat id_rsa.pub >> authorized_keys
  7. Now check if the keys where set-up correctly by executing the following command
    ssh localhost
    Since it is a new ssh installation you warned that authenticity of the host could not be established and will be prompted whether you really want to connect, answer yes and press ENTER. You should see the cygwin prompt again, which means that you have successfully connected.
  8. Now execute the command again
    ssh localhost
    This time you should not be prompted for anything.

Setting up authorization keys

Download, Copy and Unpack Hadoop

The next step is to download and unpack the hadoop distribution.

 

  1. Download hadoop 0.19.1 and place in some folder on your computer such asC:/Java.
  2. Open Cygwin command prompt.
  3. Execute the following command
    cd .
  4. Then execute the following command to get your home directory folder shown in the Windows Explorer window.
    explorer .
  5. Open another explorer window and navigate to the folder that contains the downloaded hadoop archive.
  6. Copy the hadoop archive into your home directory folder.

Unpack Hadoop Installation

The next step is to unpack the downloaded and copied package.

To unpack the package follow these steps:

  1. Open the new cygwin window
  2. After new cygwin window appears, execute the following command:
    tar -xzf hadoop-0.19.1.tar.gz
    this will start the process of unpacking the Hadoop distribution. After several minutes you should see a new CYGWIN prompt again. As shown on the screenshot below.

  3. When you see the new prompt execute the following command:
    ls -l
    This command will list the contents of your home directory. You should see a newly created directory called hadoop-0.19.1 
  4. Next execute the following commands
    cd hadoop-0.19.1
    ls -l

    if you got the following output everything was unpacked correctly, so you can go to the next step.
          total 4145
    -rw-r--r-- 1 vlad None 295315 Feb 19 19:13 CHANGES.txt
    -rw-r--r-- 1 vlad None 11358 Feb 19 19:13 LICENSE.txt
    -rw-r--r-- 1 vlad None 101 Feb 19 19:13 NOTICE.txt
    -rw-r--r-- 1 vlad None 1366 Feb 19 19:13 README.txt
    drwxr-xr-x+ 2 vlad None 0 Feb 26 05:41 bin
    -rw-r--r-- 1 vlad None 58440 Feb 19 19:13 build.xml
    drwxr-xr-x+ 4 vlad None 0 Feb 19 19:18 c++
    drwxr-xr-x+ 2 vlad None 0 Mar 10 13:46 conf
    drwxr-xr-x+ 12 vlad None 0 Feb 19 19:12 contrib
    drwxr-xr-x+ 7 vlad None 0 Feb 26 05:41 docs
    -rw-r--r-- 1 vlad None 6839 Feb 19 19:12 hadoop-0.19.1-ant.jar
    -rw-r--r-- 1 vlad None 2384306 Feb 19 19:18 hadoop-0.19.1-core.jar
    -rw-r--r-- 1 vlad None 134119 Feb 19 19:12 hadoop-0.19.1-examples.jar
    -rw-r--r-- 1 vlad None 1276792 Feb 19 19:18 hadoop-0.19.1-test.jar
    -rw-r--r-- 1 vlad None 52295 Feb 19 19:12 hadoop-0.19.1-tools.jar
    drwxr-xr-x+ 4 vlad None 0 Feb 26 05:41 lib
    drwxr-xr-x+ 3 vlad None 0 Feb 26 05:41 libhdfs
    drwxr-xr-x+ 2 vlad None 0 Feb 26 05:41 librecordio
    drwxr-xr-x+ 4 vlad None 0 Mar 10 13:46 logs
    drwxr-xr-x+ 15 vlad None 0 Feb 26 05:41 src
    -rwxr-xr-x 1 vlad None 1079 Mar 1 16:41 testProj.jar
    drwxr-xr-x+ 8 vlad None 0 Feb 19 19:12 webapps

    Configure Hadoop

    Now that we have unpacked Hadoop we are ready to configure it.

    1. Open a new cygwin window and execute the following commands
      cd hadoop-0.19.1
      cd conf
      explorer .

      Bringing up explorer window

       

    2. As a result of the last command you will see the explorer window for the 'conf'directory popped up. Minimize it for now or move it to the side.
    3. Launch eclipse
    4. Bring up the the 'conf' explorer window opened in the step 2 and drug the filehadoop-site to the eclipse main window.
    5. Insert the following lines between <configuration> and </configuration> tags.
      <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9100</value>
      </property>
      <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9101</value>
      </property>
      <property>
      <name>dfs.replication</name>
      <value>1</value>
      </property>

      Editing site configuration file

       

    6. Close the eclipse, cygwin command window and the explorer window

    Format the namenode

    Next step is to format the namenode, to create a Hadoop distributed file system (HDFS).

    1. Open a new cygwin window.
    2. Then execute the following commands
      cd hadoop-0.19.1
      mkdir logs
      bin/hadoop namenode -format

      Format the namenode
    3. The last command would run for some time, and then will produce the output similar to the one shown below:

      Hadoop Distributed File System created.

      Now that the filesystem has been created you can proceed to the next step.

       

       

       

      Install Hadoop plugin

      The next step is to install and check the Hadoop plugin for eclipse.

      1. Open the new Cygwin window. And execute the following commands
        cd hadoop-0.19.1
        cd contrib
        cd eclipse-plugin
        explorer .
        Navigate to Hadoop eclipse plugin folder
      2. Shrink the newly popped window and move it to the right side of the screen.
      3. Open another explorer window, either through "My Computer" icon or by using the "Start -> Run" menu. Navigate to your Eclipse installation and then open the "plugin" folder of your Eclipse installation.
      4. Copy the file "hadoop-0.19.1-eclipse-plugin.jar, from the Hadoop eclipse plugin folder to the Eclipse plugins folder. As shown on the figure below.

        Copy Hadoop Eclipse Plugin

      5. Close both explorer windows
      6. Start Eclipse
      7. Click on the open perspective icon , which is usually located in the upper-right corner the eclipse application. Then select Other from the menu.
      8. Select Map/Reduce from the list of perspectives and press "OK" button.
      9. As a result you IDE should open a new perspective that looks similar to the image below.

        Example of using ssh-host-setup

      Now that the we installed and configured hadoop cluster and eclipse plugin i's a time to test the setup by running a simple project.

       

       

       

       

       

      Start the local hadoop cluster

      Next step is to launch your newly configured cluster.

       

      1. Close all the windows on the desktop and then open five CYGWIN windows and arrange them in the similar fashion as below.

      2. Start the namenode in the first window by executing
        cd hadoop-0.19.1
        bin/hadoop namenode
      3. Start the secondary namenode in the second window by executing
        cd hadoop-0.19.1
        bin/hadoop secondarynamenode
      4. Start the job tracker the third window by executing
        cd hadoop-0.19.1
        bin/haoop jobtracker
      5. Start the data node the fourth window by executing
        cd hadoop-0.19.1
        bin/haoop datanode
      6. Start the task tracker the fifth window by executing
        cd hadoop-0.19.1
        bin/haoop tasktracker
      7. Now you should have an operational hadoop cluster. If everthing went fine your screen should look like the image below:

      At this point the cluster is running and you can proceed to the next step.

       

       

       

       

      Setup Hadoop Location in Eclipse

      Next step is to configure Hadoop location in the Eclipse environment.

      1. Launch the Eclipse environment.
      2. Open Map/Reduce perspective by clicking on the open perspective icon (), select "Other" from the menu, and then select "Map/Reduce" from the list of perspectives.

      3. After you switched to the Map/Reduce perspective. Select the Map/Reduce Locations tab located at the bottom portion of your eclipse environment. Then right click on the blank space in that tab and select "New Hadoop location...."from the context menu. You shall see the dialog box similar to the one shown below.

        Setting up new Map/Reduce location

      4. Fill in the following items, as shown on the figure above.
        • Location Name -- localhost
        • Map/Reduce Master
          • Host -- localhost
          • Port -- 9101
        • DFS Master
          • Check "Use M/R Master Host"
          • Port -- 9100
        • User name -- User

        Then press the Finish button.

      5. After you closed the Hadoop location settings dialog you should see a new location appearing in the "Map/Reduce Locations" tab.

      6. In the Project Explorer tab on the lefthand side of the eclipse window, find theDFS Locations item. Open it up using the "+" icon on the left side of it, inside of it you should see the localhost location reference with the blue elephant icon. Keep opening up the items below it until you see something like the image below

        Browsing HDFS location

      Now this step is completed you can move on to the next step.

       

       

       

       

      Upload data to HDFS

      Now we are almost ready to run our first Map/Reduce project. The only thing is missing is the data. This section will explain how to upload the data into Hadoop Distributed File System.

      Upload Files To HDFS

      1. Open a new CYGWIN command window.

      2. Execute the following commands in the new CYGWIN window as shown on the image above. 

        cd hadoop-0.19.1
        bin/hadoop fs -mkdir In
        bin/hadoop fs -put *.txt In

        When the last of the above commands will start execution you should see some activity happening in the rest of the hadoop windows as shown on the image below.

        The result of these commans is a newly created directory in the HDFS named Inwhich contains a set of text files that comes with the hadoop distribution.

      3. Close the Cygwin Window.

      Verify if the files were uploaded correctly

      In the section we will check if the files were uploaded correctly.

      1. Open Eclipse Environment

      2. Open DFS locations folder which is located in Project Explorer tab ofMap/Reduce perspective.
      3. Open localhost folder which is contain in the DFS locations folder
      4. Keep opening HDFS folders until you navigate to the newly created In directory. As shown on the image below.

        Verifying that the data was uploaded correctly

         

      5. When you get to the In directory double-click on the file LICENCE.TXT to open it.
      6. If you see something similar to the image above then the data was uploaded correctly and we can proceed to creating our first Hadoop project.

      Create and run Hadoop project

      Now we are ready to create and run out first Hadoop project.

      Creating and configuring Hadoop eclipse project.

      1. Launch Eclipse
      2. Right click on the blank space in the Project Explorer window and select New -> Project.. to create a new project.
      3. Select Map/Reduce Project from the list of project types. As shown on the image below.

      4. Press the Next button.
      5. You will see the project properties window similar to the one shown below

      6. Fill in the project name and then click on Configure Hadoop Installation link. Which is located on the right side of the project configuration window. This will bring up the Project preferences window shown on the image below.

      7. When Project preferences window shows up, enter the location of the hadoop directory in the Hadoop Installation Directory field as shown above. 
        If you are not sure what is your Hadoop home directory location. Follow the refer to the step 1 of  this section. The hadoop home directory is one level up from the  conf directory.
      8. After you entered the location close the preferences window by pressing OKbutton, and then close the Project window by the Finish button.
      9. Now you have created your first Hadoop eclipse project. You should see its name in the Project Explorer tab.

      Creating Map/Reduce driver class

      1. Right click on the newly created Hadoop project in the Project Explorer tab and select New -> Other from the context menu.
      2. Go to Map/Reduce folder, select MapReduceDriver then press the Next button. As shown on the image below.

      3. When MapReduce Driver wizard appears enter the TestDriver in the name field and press the Finish button. This will create the skeleton code for theMapReduce Driver

      4. Unfortunately the Hadoop plugin for eclipse is slightly out of step with the recent Hadoop API, so we need to edit the driver code a bit.

        Find two following lines in the source code and comment them out:
        conf.setInputPath(new Path("src"));
        conf.setOutputPath(new Path("out"));

        Enter the following code right immediatly below the two lines you just commented out.

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path("In"));
        FileOutputFormat.setOutputPath(conf, new Path("Out"));
        As shown on the image below 

      5. After you have changed the code, you will see the new lines marked as incorrect by the Eclipse. Click on the the error icon for each of the line and select Eclipse's suggestion to import the missing class.

        You need to import the following classes TextInputFormat, TextOutputFormat, FileInputFormat, FileOutputFormat.
      6. After the missing classes are imported we are ready to run our project.

      Running Hadoop Project

       

      1. Right click on the TestDriver class in the Project Explorer tab and select Run As --> Run on Hadoop. This will bring up the windows like the one shown below.

      2. In the window shown above select "Choose existing hadoop location" , then selectlocalhost from the list below. After that click Finish button to start your project.
      3. If you see console output similar to the one shown below, congratulations you have started the project successfully.

       

       

       

       

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值