Creating Hadoop MapReduce Job with Spring Data Apache Hadoop

最新推荐文章于 2022-01-20 20:39:54 发布

huangrunqing

最新推荐文章于 2022-01-20 20:39:54 发布

阅读量1.1k

点赞数

分类专栏：技术

技术专栏收录该内容

17 篇文章 0 订阅

订阅专栏

This tutorial describes how we can create a Hadoop MapReduce Job with Spring Data Apache Hadoop. As an example we will analyze the data of a novel called The Adventures of Sherlock Holmes and find out how many times the last name of Sherlock’s loyal sidekick Dr. Watson is mentioned in the novel.

Note: This blog entry assumes that we have already installed and configured the used Apache Hadoop instance.

We can create a Hadoop MapReduce Job with Spring Data Apache Hadoop by following these steps:

Get the required dependencies by using Maven.
Create the mapper component.
Create the reducer component.
Configure the application context.
Load the application context when the application starts.

These steps are explained with more details in the following Sections. We will also learn how we can run the created Hadoop job.

Getting the Required Dependencies with Maven

We can download the required dependencies with Maven by following these steps:

Add the Spring milestone repository to the list of repositories.
Configure the required dependencies.

Because we are using the version 1.0.0.M2 of Spring Data Apache Hadoop, we have to add the Spring milestone repository in to our pom.xml file. We can do this by adding the following repository declaration to the POM file:

<repositories>
<repository>
<id>spring-milestone </id>
<name>Spring Maven SNAPSHOT Repository </name>
<url>http://repo.springframework.org/milestone </url>
</repository>
</repositories>

Our next step is to configure the required dependencies. We have to add the dependency declations of Spring Data Apache Hadoop and Apache Hadoop Core to our POM file. We can declare these dependencies by adding the following lines to our pom.xml file:

<dependency>
<groupId>org.springframework.data </groupId>
<artifactId>spring-data-hadoop </artifactId>
<version>1.0.0.M2 </version>
</dependency>

<dependency>
<groupId>org.apache.hadoop </groupId>
<artifactId>hadoop-core </artifactId>
<version>1.0.3 </version>
</dependency>

Creating the Mapper Component

A mapper is a component that divides the original problem into smaller problems that are easier to solve. We can create a custom mapper component by extending the Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> class and overriding its map() method. The type parameters of the Mapper class are described in following:

KEYIN describes the type of the key that is provided as an input to the mapper component.
VALUEIN describes the type of the value that is provided as an input to the mapper component.
KEYOUT describes the type of the mapper component’s output key.
VALUEOUT describes the type of the mapper component’s output value.

Each type parameter must implement the Writable interface. Apache Hadoop provides several implementations to this interface. A list of existing implementations is available at the API documentation of Apache Hadoop.

Our mapper processes the contents of the input file one line at the time and produces key-value pairs where the key is a single word of the processed line and the value is always one. Our implementation of the map() method has following steps:

Split the given line into words.
Iterate through each word and remove all Unicode characters that are not either letters or numbers.
Create an output key-value pair by calling the write() method of the Mapper.Context class and providing the required parameters.

The source code of the WordMapper class looks following:

import org.apache.hadoop.io.IntWritable ;
import org.apache.hadoop.io.LongWritable ;
import org.apache.hadoop.io.Text ;
import org.apache.hadoop.mapreduce.Mapper ;

public class WordMapper extends Mapper <LongWritable, Text, Text, IntWritable > {

private Text word = new Text ( ) ;

@Override
protected void map (LongWritable key, Text value, Context context ) throws IOException, InterruptedException {
String line = value. toString ( ) ;
StringTokenizer lineTokenizer = new StringTokenizer (line ) ;
while (lineTokenizer. hasMoreTokens ( ) ) {
String cleaned = removeNonLettersOrNumbers (lineTokenizer. nextToken ( ) ) ;
word. set (cleaned ) ;
context. write (word, new IntWritable ( 1 ) ) ;
}
}

/**
* Replaces all Unicode characters that are not either letters or numbers with
* an empty string.
* @param original The original string.
* @return A string that contains only letters and numbers.
*/
private String removeNonLettersOrNumbers ( String original ) {
return original. replaceAll ( "[^\\p{L}\\p{N}]", "" ) ;
}
}

Creating the Reducer Component

A reducer is a component that removes the unwanted intermediate values and passes forward only the relevant key-value pairs. We can implement our reducer by extending the Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> class and overriding its reduce() method. The type parameters of the Reducer class are described in following:

KEYIN describes the type of the key that is provided as an input to the reducer. The value of this type parameter must match with the KEYOUT type parameter of the used mapper.
VALUEIN describes the type of the value that is provided as an input to the reducer component. The value of this type parameter must match with the VALUEOUT type parameter of the used mapper.
KEYOUT describes type of the output key of the reducer component.
VALUEOUT describes the type of the output key of the reducer component.

Our reducer processes each key-value pair produced by our mapper and creates a key-value pair that contains the answer of our question. We can implement the reduce() method by following these steps:

Verify that the input key contains the wanted word.
If the key contains the wanted word, count how many times the word was found.
Create a new output key-value pair by calling the write() method of the Reducer.Context class and providing the required parameters.

The source code of the WordReducer class is given in following:

import org.apache.hadoop.io.IntWritable ;
import org.apache.hadoop.io.Text ;
import org.apache.hadoop.mapreduce.Reducer ;

public class WordReducer extends Reducer <Text, IntWritable, Text, IntWritable > {

protected static final String TARGET_WORD = "Watson" ;

@Override
protected void reduce (Text key, Iterable <IntWritable > values, Context context ) throws IOException, InterruptedException {
if (containsTargetWord (key ) ) {
int wordCount = 0 ;
for (IntWritable value : values ) {
wordCount += value. get ( ) ;
}
context. write (key, new IntWritable (wordCount ) ) ;
}
}

private boolean containsTargetWord (Text key ) {
return key. toString ( ). equals (TARGET_WORD ) ;
}
}

Configuring the Application Context

Because Spring Data Apache Hadoop 1.0.0.M2 does not support Java configuration, we have to configure the application context of our application by using XML. We can configure the application context of our application by following these steps:

Create a properties file that contains the values of configuration properties.
Configure a property placeholder that fetches the values of configuration properties from the created property file.
Configure Apache Hadoop.
Configure the executed Hadoop job.
Configure the job runner that runs the created Hadoop job.

Creating the Properties File

Our properties file contains the values of our configuration parameters. We can create this file by following these steps:

Specify the value of the fs.default.name property. The value of this property must match with the configuration of our Apache Hadoop instance.
Specify the value of the input.path property.
Add the value of the output.path property to the properties file.

The contents of the application.properties file looks following:

fs.default.name=hdfs://localhost:9000

input.path=/input/
output.path=/output/

Configuring the Property Placeholder

We can configure the needed property placeholder by adding the following element to the applicationContext.xml file:

<context:property-placeholder location="classpath:application.properties" />

Configuring Apache Hadoop

We can use the configuration namespace element for providing configuration parameters to Apache Hadoop. In order to execute our job by using our Apache Hadoop instance, we have to configure the default file system. We can configure the default file system by adding the following element to the applicationContext.xml file:

<hdp:configuration>
fs.default.name=${fs.default.name}
</hdp:configuration>

Configuring the Hadoop Job

We can configure our Hadoop job by following these steps:

Configure the input path that contains the input files of the job.
Configure the output path of the job.
Configure the name of the mapper class.
Configure the name of the reducer class.

Note: If the configured output path exists, the execution of the Hadoop job fails. This is a safety mechanism that ensures that the results of a MapReduce job cannot be overwritten accidentally.

We have to add the following job declaration to our application context configuration file:

<hdp:job id="wordCountJob"
input-path="${input.path}"
output-path="${output.path}"
mapper="net.petrikainulainen.spring.data.apachehadoop.WordMapper"
reducer="net.petrikainulainen.spring.data.apachehadoop.WordReducer"/>

Configuring the Job Runner

The job runner is responsible of executing the jobs after the application context has been loaded. We can configure our job runner by following these steps:

Declare the job runner bean.
Configure the executed jobs.

The declaration of our job runner bean is given in following:

Loading the Application Context When the Application Starts

We can execute the created Hadoop job by loading the application context when our application is started. We can do this by creating a new ClasspathXmlApplicationContext object and providing the name of our application context configuration file as a constructor parameter. The source code of our Main class is given in following:

import org.springframework.context.ApplicationContext ;
import org.springframework.context.support.ClassPathXmlApplicationContext ;

public class Main {
public static void main ( String [ ] arguments ) {
ApplicationContext ctx = new ClassPathXmlApplicationContext ( "applicationContext.xml" ) ;
}
}

Running the MapReduce Job

We have now learned how we can create a Hadoop MapReduce job with Spring Data Apache Hadoop. Our next step is to execute the created job. The first thing we have to do is to download The Adventures of Sherlock Holmes. We must download the plain text version of this novel manually since the website of Project Gutenberg is blocking download utilities such as wget.

After we have downloaded the input file, we are ready to run our MapReduce job. We can run the created job by starting our Apache Hadoop instance in a pseudo-distributed mode and following these steps:

Upload our input file to HDFS.
Run our MapReduce job.

Uploading the Input File to HDFS

Our next step is to upload our input file to HDFS. We can do this by running the following command at command prompt:

hadoop dfs -put pg1661.txt /input/pg1661.txt

We can check that everything went fine by running the following command at command prompt:

hadoop dfs -ls /input

If the file was uploaded successfully, we should see the following directory listing:

Found 1 items
-rw-r--r-- 1 xxxx supergroup 594933 2012-08-05 12:07 /input/pg1661.txt

Running Our MapReduce Job

We have two alternative methods for running our MapReduce job:

We can execute the main() method of the Main class from our IDE.
We can build a binary distribution of our example project by running a command mvn assembly:assembly at command prompt. This creates a zip package to the target directory. We can run the created MapReduce job by unzipping this package and using the provided startup scripts.

Note: If you are not familiar with the Maven assembly plugin, you might want to read my blog entry that describes how you can create a runnable binary distribution with the Maven assembly plugin.

The outcome of our MapReduce job does not depend from the method that is used to run it. The outcome of our job should be written to the configured outcome directory of HDFS.

Note: If the execution of our MapReduce job fails because the output directory exists, we can delete the output directory by running the following command at command prompt:

hadoop dfs -rmr /output

We can check the output of our job by running the following command at command prompt:

hadoop dfs -ls /output

This command lists the files found from the /output directory of HDFS. If everything went fine, we should see a similar directory listing:

Found 2 items
-rw-r--r-- 3 xxxx supergroup 0 2012-08-05 12:31 /output/_SUCCESS
-rw-r--r-- 3 xxxx supergroup 10 2012-08-05 12:31 /output/part-r-00000

Now we will finally find out the answer to our question. We can get the answer by running the following command at command prompt:

hadoop dfs -cat /output/part-r-00000

If everything went fine, we should see following output:

Watson 81

We now know that the last name of doctor Watson was mentioned 81 times in the novel The Adventures of Sherlock Holmes.

What is Next?

My next blog entry about Apache Hadoop describes how we can create a streaming MapReduce job by using Hadoop Streaming and Spring Data Apache Hadoop.

PS. A fully functional example application that was described in this blog entry is available at Github.

Tagged as: Apache Hadoop, spring data, Spring Data Apache Hadoop, Spring Framework

huangrunqing

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Creating Hadoop MapReduce Job with Spring Data Apache Hadoop

This tutorial describes how we can create a Hadoop MapReduce Job with Spring Data Apache Hadoop. As an example we will analyze the data of a novel called The Adventures of Sherlock Holmes and find out
复制链接

扫一扫

专栏目录