Wordcount program in Hadoop


Lab 05 - Wordcount program in Hadoop 
Full name: 
Student D: 
 
Tasks: 
1. Open Eclipse inside the Cloudera platform 
Note: If Eclipse software is not installed inside you Cloudera VM, please install it. 
2. Create a Java project and a java file: 
a. File > New > Java Project 
b. Give a project name ex; “WordCount” 
c. Hit “Next” and in the next page, click on “Libraries” tab 
d. Click on “Add External JARs” button 
e. Navigate to “File System>usr>lib>Hadoop” 
f. Select all *.jar files and click “OK” 
g. Click on “Add External JARs” button again 
h. Navigate to “File System>usr>lib>Hadoop>client” 
i. Select all *.jar files (ctrl+A) and click “OK” 
j. Give a moment to load all the jar files. Once the jar files are added to the list 
of libraries, then click “Finish” button 
k. Inside the “Package Explorer”, and under the project that you created above 
(WordCound), right click on src and from “New”, select option “Class”  

 
l. In the Java Class window, just give a meaningful name for the file name (ex; 
WordCount) and the click on “Finish” button. This will create a java file with 
the give name (ex; WordCount.java) 
m. Delete the content of the newly created file and then copy the source code 
from the like bellow or from the resources at the bottom and paste the code 
inside the file. 
i. Copy the source code from the below link 
https://hadoop.apache.org/docs/current/hadoop-mapreduceclient/hadoop-mapreduce-clientcore/MapReduceTutorial.html#Source_Code

n. Check the code and ensure there is no error before you save it. 
o. Now right click on the project name “WordCount” inside “Package Explorer” 
panel and select “Export”. 
p. Open Java from the list and select “JAR File”, the click “Next” 
q. In the next window, click on “Browse…” button and navigate to a specific 
folder ex; /home/cloudera 
r. Give a name file for exporting jar file (ex; WordCount.jar) and click “OK” 
s. Click “Finish” button 
t. Check the jar file is available inside the “cloudera” folder 
3. Open terminal 
4. Create a simple text file ( to do that, follow the any methods you learn in previous 
labs) 
a. Use the command: cat > /home/cloudera/Processfile.txt 
b. Add some lines and try to use duplicated words inside  

 
c. To save and close the file opened by cat function: ctrl+d or ctrl+z 
d. Check the file is created and check the content to ensure the inserted content 
is saved. cat /home/cloudera/Processfile.txt 
5. Check hdfs files and directory list: hdfs dfs -ls / 
6. Create a folder inside the root of hdfs: hdfs dfs -mkdir /inputfoler 
7. Check the folder is created 
8. Copy Processfile.txt from local storage into inputfoler inside hdfs 
a. Command: hdfs dfs -put /home/cloudera/Processfile.txt /inputfolder/ 
b. Check the content of file: hdfs dfs -cat /inputfolder/Processfile.txt 
9. Run the Hadoop syntax to run the wordcount 
a. Command: Hadoop jar /home/cloudera/WordCount.jar WordCount 
/inputfolder/Processfile.txt /outputfolder 
b. Hit enter and wait for the jar file to be executed 
10. Read the content of the messages that are shown by Hadoop during the execution 
process and try to understand and write your findings in your report 
11. After the execuation of the program is finished, check the folder outputfolder inside 
hdfs to see if there is any output file: hdfs dfs -ls /outputfolder 
12. Open the file with name similar to “part-r-00000” or any file that is created and the 
size is greater than Zero 
13. Check the word counts and cross check with the content of the Processfile.txt to 
validate counting is correct.  

 
14. Run the same wordcount program with another text file. select a text file of your choice. 
But it is recommended to do not use a heavy file for this practice to prevent heavy 
process inside your VM that takes longer time to be completed. 
15. Check the source code used in task 2.m and write your understanding of the code in 
your report. 
16. Save and submit your Lab documents in PDF with the following filename format. 
Submit your report via the submission link for this available on MyTIMeS. 
Filename format: Name_ID_Lab05 
 
Resources: 
Source code for WordCount from hadoop.apache.org website 
  
import java.io.IOException; 
import java.util.StringTokenizer; 
 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.Mapper; 
import org.apache.hadoop.mapreduce.Reducer; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
 
public class WordCount { 
 
 public static class TokenizerMapper 
 extends Mapper<Object, Text, Text, IntWritable>{ 
 
 private final static IntWritable one = new IntWritable(1); 
 private Text word = new Text(); 
 
 public void map(Object key, Text value, Context context 
 ) throws IOException, InterruptedException { 
 StringTokenizer itr = new StringTokenizer(value.toString()); 
 while (itr.hasMoreTokens()) { 
 word.set(itr.nextToken()); 
 context.write(word, one); 
 } 
 } 
 } 
 public static class IntSumReducer 
 extends Reducer<Text,IntWritable,Text,IntWritable> { 
 private IntWritable result = new IntWritable(); 
 
 public void reduce(Text key, Iterable<IntWritable> values, 
 Context context 
 ) throws IOException, InterruptedException { 
 int sum = 0; 
 for (IntWritable val : values) { 
 sum += val.get(); 
 } 
 result.set(sum); 
 context.write(key, result); 
 } 
 } 
 
 public static void main(String[] args) throws Exception { 
 Configuration conf = new Configuration(); 
 Job job = Job.getInstance(conf, "word count"); 
 job.setJarByClass(WordCount.class); 
 job.setMapperClass(TokenizerMapper.class); 
 job.setCombinerClass(IntSumReducer.class); 
 job.setReducerClass(IntSumReducer.class); 
 job.setOutputKeyClass(Text.class); 
 job.setOutputValueClass(IntWritable.class); 
 FileInputFormat.addInputPath(job, new Path(args[0])); 
 FileOutputFormat.setOutputPath(job, new Path(args[1])); 
 System.exit(job.waitForCompletion(true) ? 0 : 1); 
 } 

 

  • 19
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值