Step-by-Step Guide to Setting Up an R-Hadoop System_hadoop ecosystem the ultimate step-by-step guide-CSDN博客

Step-by-Step Guide to Setting Up an R-Hadoop System

30 May 2014

This is a step-by-step guide to setting up an R-Hadoop system. I have tested it both on a single computer and on a cluster of computers. Note that this process is for Mac OS X and some steps or settings might be different for Windows or Ubuntu.

To install Hadoop on Windows, you can find detailed instructions at

Build and Install Hadoop 2.2 or newer on Windows

Below is a list of software used for this setup.

OS and other tools:
- Mac OS X 10.6.8, Java 1.6.0_65, Homebrew, thrift 0.9.0
Hadoop and HBase:
- Hadoop 1.1.2, HBase 0.94.17
R and RHadoop packages:
- R 3.1.0, rhdfs 1.0.8, rmr2 3.1.0, plyrmr 0.2.0, rhbase 1.2.0

This process should work with Hadoop 2.2 or above and newer versions of HBase as well, but I haven't tested it yet.

Homebrew is a missing package manager for Mac OS X, and it is needed for install git, pkg-config andthrift. For other operating systems, the equivalents to Homebrew are apt-get on Ubuntu and yum on CentOS.

By the way, two painful steps in this process are setting up HBase on Hadoop in cluster mode and installing rhbase. If you want to have a quick start or are not going to use HBase, you donot need to intallthrift, HBase or rhbase, and therefore can skip

step 3 - Install HBase,
step 5.4 - Install thrift 0.9.0, and
installing rhbase at step 7.3.

1. Set up single-node Hadoop

If building a Hadoop system for the first time, you are suggested to start with a stand-alone mode first, and then switch to pseudo-distributed mode and cluster (fully-distributed) mode.

1.1 Download Hadoop

Download Hadoop from http://hadoop.apache.org/releases.html#Download and then unpack it.

1.2 Set up Hadoop in standalone mode

1.2.1 Set JAVA_HOME

In file conf/hadoop_env.sh, add the line below:

          
          
           
           export JAVA_HOME=/Library/Java/Home

1.2.2 Set up remote desktop and enabling self-login

Open the “System Preferences” window, and click “Sharing”“ (under "Internet & Wireless”). Under the list of services, check “Remote Login”. For extra security, you can hit the radio button for “Allow access for only these Users” and select your account, which we assume is “hadoop”.

After that, save authorized keys so that you can log in localhost without typing a password.

          
          
           
           ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

The above step to set up remote desktop and self-login was picked up fromhttp://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_%28Single-Node_Cluster%29, which provides detailed instructions to set up Hadoop on Mac.

1.2.3 Run Hadoop

After that, run commands below under system console to check whether Hadoop has been installed properly in a stand-alone mode.

          
          
           
           ## go to hadoop directory
cd hadoop-1.1.2

## see a list of Hadoop commands
bin/hadoop

## version of Hadoop
bin/hadoop version

## start Hadoop
bin/start-all.sh

## check Hadoop is running
jps

## stop Hadoop
bin/stop-all.sh

After running jps, You should see a list of services below.

	Hadoop 1.1.2	Hadoop 2.2.0 or above
master node	NameNode	NameNode
	SecondaryNameNode	ResourceManager
	JobTracker	JobHistoryServer
slave node	DataNode	DataNode
	TaskTracker	NodeManager

1.3 Test Hadoop

Then we test Hadoop with two examples to make sure that it works.

1.3.1 Example 1 - calculate pi

          
          
           
           bin/hadoop jar hadoop-examples-*.jar pi 10 100

In the above code, the first argument (10) is the number of maps and the second the number of samples per map. A more accurate value of pi can be obtained by setting a larger value to the second argument, which in turn would take longer to run.

1.3.2 Example 2 - word count

In this example, all files in local folder hadoop-1.1.2/conf are copied to a HDFS directory input, to be used as input for pattern searching. Of course you can use other available text files as input.

          
          
           
           ## copy files
bin/hadoop fs -put conf input

## run distributed grep, and save results in directory *output*
## The pattern to find is 'dfs[a-z.]+'. 
## Change it to 'df[a-z.]+' or 'd[a-z.]+' to get more results.
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

## copy result from HDFS directory *output* to local directory *output*
bin/hadoop fs -get output output

## have a look at results
cat output/*

2 Set up Hadoop in cluster mode

If your Hadoop works in a standalone mode, you can then proceed to a cluster (full-distributed) mode.

2.1 Switching between different modes

You may want to keep settings for all three modes, because you will likely need to switch between different modes, for trouble-shooting in HBase and RHadoop installation at later stages. Therefore, you are suggested to keep settings for three modes in three separate directories, conf.single, conf.pseudo andconf.cluster, and use commands below to choose a specific setting. Same applies to HBase settings.

          
          
           
           ln -s conf.single conf
ln -s conf.pseudo conf
ln -s conf.cluster conf

2.2 Setup name node (master machine)

Configure the following 3 files on master machine

core-site.xml
hdfs-site.xml
mapred-site.xml

Set masters and slaves files

file “masters”: IP address or hostname of namenode (master machine)
file “slaves”: a list of IP addresses or hostnames of datanodes (slave machines)

2.3 Set JAVA_HOME, set up remote desktop and enable self-login on all nodes

This is similar to step 1.2.2.

2.4 Copy public key

Copy the public key created on master node to all slave nodes.

2.5 Firewall

Enable incoming connections for Java on all machines, otherwise, slaves would not be able to receive any jobs.

2.6 Setup data nodes (slave machines)

Tar the hadoop directory on master node, copy it to all slaves and then untar it.

2.7 Format name node

Go to Handoop directory and run

          
          
           
           bin/hadoop namenode -format

2.8 Run Hadoop

Start Hadoop

          
          
           
           bin/start-all.sh

Monitor nodes and jobs with browser:

Namenode and HDFS file system: http://IP_ADDR_OF_NAMENODE:50070
Hadoop job tracker: http://IP_ADDR_OF_NAMENODE:50030

Stop Hadoop and MapReduce:

          
          
           
           bin/stop-all.sh

2.9 Test Hadoop

You may want to test Hadoop in cluster mode, use the same code given at step 1.3.

2.10 Further Information

More instuctions on setting up Hadoop are available at links below.

2.10.1 Single-node mode

2.10.2 Cluster mode

3. Set up HBase

3.1 Set up HBase

You can skip this step if you are not going to use HBase.

See links below for detailed instructions on setting up HBase on Hadoop.

I used the settings given in section 2.4 - Example Configurations at this link to set up HBase in fully distributed mode.

3.2 Switching between different modes

Same as Hadoop, you are suggested to start with a stand-alone mode first. After that, you can switch to pseudo-distribution or cluster mode. However, you are suggested to keep settings for all three modes, e.g., for possible switching between different modes when you install RHadoop at a later stage. See step 2.1 for details about switching between different modes.

4. Install R

The version of R that I used is 3.1.0, the latest version as of May 2014. Previously I set up an R-Hadoop system with R 2.15.2 before, so it should work with other versions of R, at least with R 2.15.2 and above.

It is recommended to install RStudio as well, if it is not installed yet. This will make it easier for R programming and managing R projects, although it is not mandatory.

5. Install GCC, Homebrew, git, pkg-config and thrift

GCC, Homebrew, git, pkg-config and thrift are mandatory for installing rhbase. If you donot use HBase orrhbase, you donot need to install pkg-config or thrift.

5.1 Download and install GCC

Download GCC at https://github.com/kennethreitz/osx-gcc-installer. Without GCC, you will get error “Make Command Not Found” when installing some R packages from source.

5.2 Install Homebrew

Homebrew is a missing package manager for Mac OS X. The current user account needs to be an administrator or be granted with administrator privileges using “su” to install Homebrew.

          
          
           
           su <administrator_account>
ruby -e "$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)"
brew update
brew doctor

Refer to the Homebrew website at http://brew.sh if any errors at above step.

5.3 Install git and pkg-config

          
          
           
           brew install git
brew install pkg-config

5.4 Install thrift 0.9.0

Thrift is needed for installing rhbase. If you donot use HBase, you might skip thrift installation.

Install thrift 0.9.0 instead of 0.9.1. I first installed thrift 0.9.1 (which was the latest version at that time), and found it didn't work well for rhbase installation. And then it was a painful process to figure out the reason, uninstall 0.9.1 and then install 0.9.0.

Do NOT run command below, which will install latest version of thrift (0.9.1 as of 9 May 2014).

          
          
           
           ## Do NOT run command below !!!
brew install thrift

Instead, follow steps below to install thrift 0.9.0.

          
          
           
           $ brew versions thrift

Warning: brew-versions is unsupported and may be removed soon.
Please use the homebrew-versions tap instead:
  https://github.com/Homebrew/homebrew-versions
0.9.1    git checkout eccc96b Library/Formula/thrift.rb
0.9.0    git checkout c43fc30 Library/Formula/thrift.rb
0.8.0    git checkout e5475d9 Library/Formula/thrift.rb
0.7.0    git checkout 141ddb6 Library/Formula/thrift.rb
          
          
...

Find the formula for thrift 0.9.0 in above list, and install with that formula.

          
          
           
           ## go to the homebrew base directory
$ cd $( brew --prefix )

## check out thrift 0.9.0
git checkout c43fc30 Library/Formula/thrift.rb

## instal thrift
brew install thrift

Then we check whether pkg-config path is correct.

          
          
           
           pkg-config --cflags thrift

The above command should return -I/usr/local/Cellar/thrift/0.9.0/include/thrift or -I/usr/local/include/thrift. Note that it should end with /include/thrift instead of /include. Otherwise, you will come across errors saying that some .h files can not be found when installing rhbase.

If you have any problem with installing thrift 0.9.0, see details about how to install a specific version of formula with Homebrew at http://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula.

5.5 More instructions

If there are problems with installing other packages above, more instructions can be found at links below.

Note that there are some differences between this process and instructions from the links below. For example, On Mac, there is no libthrift-0.9.0.so but libthrift-0.9.0.dylib, so I haven't run the command below to copy Thrift library.

          
          
           
           sudo cp /usr/local/lib/libthrift-0.9.0.so /usr/lib/

6. Environment settings

Run code below in R to set environment variables for Hadoop.

          
          
           
           Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

Alternatively, add above to ~/.bashrc so that you don't need to set them every time.

          
          
           
           export HADOOP_PREFIX=/Users/hadoop/hadoop-1.1.2
export HADOOP_CMD=/Users/hadoop/hadoop-1.1.2/bin/hadoop
export HADOOP_STREAMING=/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar

7. Install RHadoop: rhdfs, rhbase, rmr2 and plyrmr

7.1 Install relevant R packages

          
          
           
           install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", 
                   "functional", "stringr", "plyr", "reshape2", "dplyr", 
                   "R.methodsS3", "caTools", "Hmisc"))

RHadoop packages are dependent on above packages, which should be installed for all users, instead of in personal library. Otherwise, you may see RHadoop jobs fail with an error saying “package *** is not installed”. For example, to make sure that package functional are installed in the correct library, run commands below and it should be in path/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional, instead of/Users/YOUR_USER_ACCOUNT/Library/R/3.1/library/functional. If it is in the library under your user account, you need to reinstall it to /Library/Frameworks/R.framework/Versions/3.1/Resources/library/. If your account has no access to it, use an administrator account.

The destination library can be set with function install.packages() using argument lib (see an example below), or with RStudio, choose from a drop-down list under “Install to library” in a pop-up window Install Packages.

          
          
           
           ## find your R libraries
.libPaths()
#"/Users/hadoop/Library/R/3.1/library" 
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

## check which library a package was installed into
system.file(package="functional")
#"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/functional"

## install package to a specific library
install.packages("functional", lib="/Library/Frameworks/R.framework/Versions/3.1/Resources/library")

In addition to above packages, you are also suggested to install data.table. Without it, I came across an error when running an RHadoop job on a big dataset, although the same job worked fine on a smaller dataset. The reason could be that RHadoop uses data.table to handle large data.

          
          
           
           install.packages("data.table")

7.2 Set environment variables HADOOP_CMD and HADOOP_STREAMING

Set environment variables for Hadoop, if you haven't done so at step 6.

          
          
           
           Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")

7.3 Install RHadoop packages

Download packages rhdfs, rhbase, rmr2 and plyrmr fromhttps://github.com/RevolutionAnalytics/RHadoop/wiki and install them. Same as step 7.1, these packages need to be installed to a library for all users, instead of to a personal library. Otherwise, you would find R-Hadoop jobs fail on those nodes where packages are not installed in the right library.

          
          
           
           install.packages("<path>/rhdfs_1.0.8.tar.gz", repos=NULL, type="source")
install.packages("<path>/rmr2_2.2.2.tar.gz", repos=NULL, type="source")
install.packages("<path>plyrmr_0.2.0.tar.gz", repos=NULL, type="source")
install.packages("<path>/rhbase_1.2.0.tar.gz", repos=NULL, type="source")

7.4 Further information

If you follow above instructions but still come across errors at this step, refer to rmr prerequisites and installation at https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr#prerequisites-and-installation.

8. Run an R job on Hadoop

Below is an example to count words in text files from HDFS folder wordcount/data. The R code is fromJeffrey Breen's presentation on Using R with Hadoop.

First, we copy some text files to HDFS folder wordcount/data.

          
          
           
           ## copy local text file to hdfs
bin/hadoop fs -copyFromLocal /Users/hadoop/try-hadoop/wordcount/data/*.txt wordcount/data/
          
          

After that, we can use R code below to run a Hadoop job for word counting.

          
          
           
           Sys.setenv("HADOOP_PREFIX"="/Users/hadoop/hadoop-1.1.2")
Sys.setenv("HADOOP_CMD"="/Users/hadoop/hadoop-1.1.2/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/Users/hadoop/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar")

library(rmr2) 

## map function
map <- function(k,lines) {
  words.list <- strsplit(lines, '\\s') 
  words <- unlist(words.list)
  return( keyval(words, 1) )
}

## reduce function
reduce <- function(word, counts) { 
  keyval(word, sum(counts))
}

wordcount <- function (input, output=NULL) { 
  mapreduce(input=input, output=output, input.format="text", 
            map=map, reduce=reduce)
}


## delete previous result if any
system("/Users/hadoop/hadoop-1.1.2/bin/hadoop fs -rmr wordcount/out")

## Submit job
hdfs.root <- 'wordcount'
hdfs.data <- file.path(hdfs.root, 'data') 
hdfs.out <- file.path(hdfs.root, 'out') 
out <- wordcount(hdfs.data, hdfs.out)

## Fetch results from HDFS
results <- from.dfs(out)

## check top 30 frequent words
results.df <- as.data.frame(results, stringsAsFactors=F) 
colnames(results.df) <- c('word', 'count') 
head(results.df[order(results.df$count, decreasing=T), ], 30)

If you can see a list of words and their frequencies, congratulations and now you are ready to do MapReduce work with R!

9. Setting up multiple users

Now you might want to set up accounts for other users to use Hadoop. Detailed instructions on that can be found at Setting Up Multiple Users in Hadoop Clusters.

10. Further readings

More examples of R jobs on Hadoop with rmr2 can be found at

To learn MapReduce and Hadoop, below are some documents to read.

Besides RHadoop, another way to run R jobs on Hadoop is using RHIPE.

11. Contact and feedback

If you have successfully built up your R-Hadoop system, could you please share your success with R users at this thread in the RDataMining group? Please also donot forget to forward this tutorial to your friends and colleagues who are interested in running R on Hadoop.

If you have any comments or suggestions, or find errors in above process, please feel free to contact Yanchang Zhao yanchang@rdatamining.com, or post your questions to my RDataMining group on LinkedIn.

Thanks.