Intellij IDEA 调试Hadoop 源码

Debug Hadoop source code using an IDE(Intellij Idea)


This is my 100th blog post ;-)

If you are someone who wants to dive into Hadoop source code and get a feel of the implementation details of all the abstracted out nitty-gritties of Hadoop's architectural overview, and want to get your hands dirty by modifying a thing or two; may be because you have just started your masters research on Hadoop or just for the sake of understanding the control flow; this post is for you.
I use IntelliJ Idea Community Edition as my IDE(yes, because I don't like Eclipse), but this post should be fairly understandable to Eclipse fans too; although I won't be providing the steps for Eclipse. If you are not proficient with Eclipse then please download IntelliJ Idea from  here and use it, instead of posting mundane comments like  how to perform step number X in Eclipse(or Netbeans or JCreator or Java IDE #9510). Make sure, you scroll down and choose the Community Edition to download. If you are on *nix, better use your distro's package manager to get it.
Ok, lets start:
Step #1: Download Hadoop
Downloading the latest version of Hadoop along with source code is simple. Just type Download Hadoop in your browser's omni search bar and follow your instinct. For the lazy soles in the kingdom of Dark Room at 3AM, here is the  link. There are two tarballs of interest. One is  hadoop-<version>.tar.gzwhich is around 60MB in size and the other is  hadoop-<version>-bin.tar.gz which is around 33MB in size. The one with a  bin in the name doesn't have the source code, only the binary executable is there. So, obviously download the one without bin in the name.

Step #2: Unpack the tarball and import in IntelliJ Idea
After the download, unpack the tarball. With the following command(if you are on *nix):
tar -xzf hadoop-<version>.tar.gz

Now fire up IntelliJ Idea. If you have just installed it, you will need to accept the License agreement. You will then, get to a screen like this:
Tip #1: Full resolution images
Click on any screenshot thumbnail to view the large image.

Click  Import Project and choose the directory named  hadoop-<version> eg. hadoop-1.0.4 which got materialized when you unpacked the tarball. An  Import Project dialog will open. Then, blindly keep clicking next. During this, Idea will first search for sources, then libraries, then modules and then move to selecting project SDK. I would recommend setting the SDK as Sun Java 6. If you don't have it in your machine and you just have OpenJDK then download it from Oracle's site  here. Extract JDK to somewhere, for example /opt and make IntelliJ Idea point there in the  Select Project SDK page of Import Project wizard. Afterwards, it will try searching for frameworks used and will find nothing. Here are the screenshots for all these steps, if you get stuck somewhere.







Click finish in the last step and you have successfully imported Hadoop in the IDE. You will then be greeted by a screen like this.


Step #3: Add the build.xml as Ant build file
Right click the file  build.xml in the left pane(Project Structure) and click the last option that says  Add as Ant build file
To test whether all is well, click the  Ant Build button in the extreme right bar to reveal  Ant Build dock. Then double click the  clean target to execute it. Once it is successfully executed double click the compile target.

If all is well, both  clean and  compile targets should execute successfully. If the  compile target gets stuck at  Executing task: get, you probably need a non-proxied internet connection. You can still get it working over proxy, but that is beyond the scope of this post.

Tip #2: Change keymap to Eclipse
But before we get into the source code, I will recommend setting your keymap to Eclipse style. That can be done in File > Settings > Keymap as shown in below screenshot.
We did this because Eclipse is Ubiquitous and most of you are familiar with Eclipse shortcuts.

Step #4: Create a debug configuration
Now we have to setup a  Run/Debug configuration. In the Run menu, click Edit Configurations. Click the + sign on the top left and click Application.

Now what to fill in the text fields in this dialog? Let's find out!!!
Open the file hadoop-1.0.4/bin/hadoop in a text editor. Scroll down to the end and modify the two lines with exec with echo; shown in the screenshots below.
 Modify exec to echo.

This will let us see the exact command line for running a MapReduce job. Now open a terminal and navigate to hadoop directory and type this command:
bin/hadoop jar hadoop-examples-1.0.4.jar wordcount conf output
You will get a huge output. The syntax is as follows:
javaExecutablePath VMOptions mainClassFile programArguments
The output on my machine looks like this:
/opt/java/bin/java -Dproc_jar -Xmx1000m -Dhadoop.log.dir=...
...
...jsp-api-2.1.jar org.apache.hadoop.util.RunJar hadoop-examples-1.0.4.jar WordCount conf output 

  • /opt/java/bin/java is my javaExecutablePath
  • org.apache.hadoop.util.RunJar is mainClassFile that will start hadoop.
  • hadoop-examples-1.0.4.jar WordCount conf output is the programArguments list.
  • the huge thing denoted with dots above is the VM options.
So, fill in the text fields in debug configurations dialog accordingly. In the  Before Launch section add the ant targets  clean and  compile as shown in the screenshot. In the  Use Classpath of Module field, select hadoop-1.0.4.  The below screenshot shows my configuration
Click Ok. Now lets test our configuration. Click the  Debug Hadoop button from the toolbar as shown in the screenshot.
If all goes well, you will get expected output in  Console tab of the bottom dock as shown in screenshot.

Next let us see how to put breakpoints and step through the code.

Step #5: Add breakpoints in source code
Press  Ctrl+Shift+R and type  RunJar. Select the  RunJar.java from dropdown list and press enter. RunJar is the main class in Hadoop-1.0.4. 
The source for  RunJar.java will open up. Press Ctrl+O and type  main and press enter. You will jump to the  main method. At the first line of the code in  main method, click in the gutter to add a breakpoint in that line. See screenshot below. Click at the location where a red circle is shown in the screenshot. That's gutter area. For you the red circle will appear after clicking.
Now that you have added a breakpoint, you can click the  Debug button in the toolbar and after the clean and  compile targets are executed, the program execution will begin and it will stop at the line where you added the breakpoint. From there, you can  step intostep over and  step out in the code from the run menu or F5, F6 or F7 keys.
Now you are free to modify hadoop code and testing your changes.

Once you are done with this and spend some time on it, you will find out that you aren't able to follow the JobTracker or the TaskTracker's execution. This is because they are separate processes and run in different JVMs. In the next blog post I will cover how to debug JobTracker and TaskTracker.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值