Wordcount program in Hadoop

wolfyouto

于 2024-06-09 18:23:16 发布

阅读量837

点赞数 19

文章标签： python pycharm 开发语言

本文链接：https://blog.csdn.net/wolfyouto/article/details/139564285

版权

Lab 05 - Wordcount program in Hadoop
Full name:
Student D:

Tasks:
1. Open Eclipse inside the Cloudera platform
Note: If Eclipse software is not installed inside you Cloudera VM, please install it.
2. Create a Java project and a java file:
a. File > New > Java Project
b. Give a project name ex; “WordCount”
c. Hit “Next” and in the next page, click on “Libraries” tab
d. Click on “Add External JARs” button
e. Navigate to “File System>usr>lib>Hadoop”
f. Select all *.jar files and click “OK”
g. Click on “Add External JARs” button again
h. Navigate to “File System>usr>lib>Hadoop>client”
i. Select all *.jar files (ctrl+A) and click “OK”
j. Give a moment to load all the jar files. Once the jar files are added to the list
of libraries, then click “Finish” button
k. Inside the “Package Explorer”, and under the project that you created above
(WordCound), right click on src and from “New”, select option “Class”
2

l. In the Java Class window, just give a meaningful name for the file name (ex;
WordCount) and the click on “Finish” button. This will create a java file with
the give name (ex; WordCount.java)
m. Delete the content of the newly created file and then copy the source code
from the like bellow or from the resources at the bottom and paste the code
inside the file.
i. Copy the source code from the below link
https://hadoop.apache.org/docs/current/hadoop-mapreduceclient/hadoop-mapreduce-clientcore/MapReduceTutorial.html#Source_Code

n. Check the code and ensure there is no error before you save it.
o. Now right click on the project name “WordCount” inside “Package Explorer”
panel and select “Export”.
p. Open Java from the list and select “JAR File”, the click “Next”
q. In the next window, click on “Browse…” button and navigate to a specific
folder ex; /home/cloudera
r. Give a name file for exporting jar file (ex; WordCount.jar) and click “OK”
s. Click “Finish” button
t. Check the jar file is available inside the “cloudera” folder
3. Open terminal
4. Create a simple text file ( to do that, follow the any methods you learn in previous
labs)
a. Use the command: cat > /home/cloudera/Processfile.txt
b. Add some lines and try to use duplicated words inside
3

c. To save and close the file opened by cat function: ctrl+d or ctrl+z
d. Check the file is created and check the content to ensure the inserted content
is saved. cat /home/cloudera/Processfile.txt
5. Check hdfs files and directory list: hdfs dfs -ls /
6. Create a folder inside the root of hdfs: hdfs dfs -mkdir /inputfoler
7. Check the folder is created
8. Copy Processfile.txt from local storage into inputfoler inside hdfs
a. Command: hdfs dfs -put /home/cloudera/Processfile.txt /inputfolder/
b. Check the content of file: hdfs dfs -cat /inputfolder/Processfile.txt
9. Run the Hadoop syntax to run the wordcount
a. Command: Hadoop jar /home/cloudera/WordCount.jar WordCount
/inputfolder/Processfile.txt /outputfolder
b. Hit enter and wait for the jar file to be executed
10. Read the content of the messages that are shown by Hadoop during the execution
process and try to understand and write your findings in your report
11. After the execuation of the program is finished, check the folder outputfolder inside
hdfs to see if there is any output file: hdfs dfs -ls /outputfolder
12. Open the file with name similar to “part-r-00000” or any file that is created and the
size is greater than Zero
13. Check the word counts and cross check with the content of the Processfile.txt to
validate counting is correct.
4

14. Run the same wordcount program with another text file. select a text file of your choice.
But it is recommended to do not use a heavy file for this practice to prevent heavy
process inside your VM that takes longer time to be completed.
15. Check the source code used in task 2.m and write your understanding of the code in
your report.
16. Save and submit your Lab documents in PDF with the following filename format.
Submit your report via the submission link for this available on MyTIMeS.
Filename format: Name_ID_Lab05

Resources:
Source code for WordCount from hadoop.apache.org website

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}