Step-by-Step Guide to Setting Up an R-Hadoop System
This is a step-by-step guide to setting up an R-Hadoop system. I have tested it both on a single computer and on a cluster of computers. Note that this process is for Mac OS X and some steps or settings might be different for Windows or Ubuntu.
To install Hadoop on Windows, you can find detailed instructions at Below is a list of software used for this setup.
This process should work with Hadoop 2.2 or above and newer versions of HBase as well, but I haven't tested it yet. Homebrew is a missing package manager for Mac OS X, and it is needed for install git, pkg-config andthrift. For other operating systems, the equivalents to Homebrew are apt-get on Ubuntu and yum on CentOS. By the way, two painful steps in this process are setting up HBase on Hadoop in cluster mode and installing rhbase. If you want to have a quick start or are not going to use HBase, you donot need to intallthrift, HBase or rhbase, and therefore can skip
1. Set up single-node HadoopIf building a Hadoop system for the first time, you are suggested to start with a stand-alone mode first, and then switch to pseudo-distributed mode and cluster (fully-distributed) mode. 1.1 Download HadoopDownload Hadoop from http://hadoop.apache.org/releases.html#Download and then unpack it. 1.2 Set up Hadoop in standalone mode1.2.1 Set JAVA_HOMEIn file conf/hadoop_env.sh, add the line below:
1.2.2 Set up remote desktop and enabling self-loginOpen the “System Preferences” window, and click “Sharing”“ (under "Internet & Wireless”). Under the list of services, check “Remote Login”. For extra security, you can hit the radio button for “Allow access for only these Users” and select your account, which we assume is “hadoop”. After that, save authorized keys so that you can log in localhost without typing a password.
The above step to set up remote desktop and self-login was picked up fromhttp://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac. 1.2.3 Run HadoopAfter that, run commands below under system console to check whether Hadoop has been installed properly in a stand-alone mode.
After running
1.3 Test HadoopThen we test Hadoop with two examples to make sure that it works. 1.3.1 Example 1 - calculate pi
In the above code, the first argument (10) is the number of maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run. 1.3.2 Example 2 - word countIn this example, all files in local folder hadoop-1.1.2/conf are copied to a HDFS directory input, to be used as input for pattern searching. Of course you can use other available text files as input.
2 Set up Hadoop in cluster modeIf your Hadoop works in a standalone mode, you can then proceed to a cluster (full-distributed) mode. 2.1 Switching between different modesYou may want to keep settings for all three modes, because you will likely need to switch between different modes, for trouble-shooting in HBase and RHadoop installation at later stages. Therefore, you are suggested to keep settings for three modes in three separate directories, conf.single, conf.pseudo andconf.cluster, and use commands below to choose a specific setting. Same applies to HBase settings.
2.2 Setup name node (master machine)Configure the following 3 files on master machine
Set masters and slaves files
2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodesThis is similar to step 1.2.2. 2.4 Copy public keyCopy the public key created on master node to all slave nodes. 2.5 FirewallEnable incoming connections for Java on all machines, otherwise, slaves would not be able to receive any jobs. 2.6 Setup data nodes (slave machines)Tar the hadoop directory on master node, copy it to all slaves and then untar it. 2.7 Format name nodeGo to Handoop directory and run
2.8 Run HadoopStart Hadoop
Monitor nodes and jobs with browser:
Stop Hadoop and MapReduce:
2.9 Test HadoopYou may want to test Hadoop in cluster mode, use the same code given at step 1.3. 2.10 Further InformationMore instuctions on setting up Hadoop are available at links below. 2.10.1 Single-node mode
2.10.2 Cluster mode3. Set up HBase3.1 Set up HBaseYou can skip this step if you are not going to use HBase. See links below for detailed instructions on setting up HBase on Hadoop.
I used the settings given in section 2.4 - Example Configurations at this link to set up HBase in fully distributed mode. 3.2 Switching between different modesSame as Hadoop, you are suggested to start with a stand-alone mode first. After that, you can switch to pseudo-distribution or cluster mode. However, you are suggested to keep settings for all three modes, e.g., for possible switching between different modes when you install RHadoop at a later stage. See step 2.1 for details about switching between different modes.
4. Install RThe version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above. It is recommended to install RStudio as well, if it is not installed yet. This will make it easier for R programming and managing R projects, although it is not mandatory.
5. Install GCC, Homebrew, git, pkg-config and thriftGCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase orrhbase, you donot need to install pkg-config or thrift. 5.1 Download and install GCCDownload GCC at https://github.com/kennethreitz/osx-gcc-installer. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source. 5.2 Install HomebrewHomebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.
Refer to the Homebrew website at http://brew.sh if any errors at above step. 5.3 Install git and pkg-config
5.4 Install thrift 0.9.0Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation. Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0. Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).
Instead, follow steps below to install thrift 0.9.0.
Find the formula for thrift 0.9.0 in above list, and install with that formula.
Then we check whether pkg-config path is correct.
The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying that some .h files can not be found when installing rhbase. If you have any problem with installing thrift 0.9.0, see details about how to install a specific version of formula with Homebrew at http://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula. 5.5 More instructionsIf there are problems with installing other packages above, more instructions can be found at links below.
Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.
6. Environment settingsRun code below in R to set environment variables for Hadoop.
Alternatively, add above to ~/.bashrc so that you don't need to set them every time.
7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr7.1 Install relevant R packages
RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are installed in the correct library, run commands below and it should be in path/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of/Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account. The destination library can be set with function
In addition to above packages, you are also suggested to install
7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMINGSet environment variables for Hadoop, if you haven't done so at step 6.
7.3 Install RHadoop packagesDownload packages
7.4 Further informationIf you follow above instructions but still come across errors at this step, refer to rmr prerequisites and installation at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.
8. Run an R job on HadoopBelow is an example to count words in text files from HDFS folder wordcount/data. The R code is fromJeffrey Breen's presentation on Using R with Hadoop. First, we copy some text files to HDFS folder wordcount/data.
If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R!
9. Setting up multiple usersNow you might want to set up accounts for other users to use Hadoop. Detailed instructions on that can be found at Setting Up Multiple Users in Hadoop Clusters.
10. Further readingsMore examples of R jobs on Hadoop with rmr2 can be found at
To learn MapReduce and Hadoop, below are some documents to read. Besides RHadoop, another way to run R jobs on Hadoop is using RHIPE. 11. Contact and feedbackIf you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread in the RDataMining group? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop. If you have any comments or suggestions, or find errors in above process, please feel free to contact Yanchang Zhao yanchang@rdatamining.com, or post your questions to my RDataMining group on LinkedIn. Thanks. |