一、系统安装运行情况
下面将分别介绍实验的系统环境及hadoop安装运行情况。
1. 实验环境
处理器:Intel Core i5-2500S CPU @2.70GHz
系统:Ubuntu 16.04 LTS 64bit
内存:2GB
磁盘:20 GB
Java环境:
java version "1.8.0_111" Java(TM) SE Runtime Environment (build 1.8.0_111-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode) |
jdk路径:/usr/java/jdk1.8.0_111
2. Hadoop安装运行情况
下面按顺序介绍hadoop的安装流程。
(1)创建hadoop-user用户组和hadoop用户
创建hadoop-user用户组和hadoop用户:
wup@ubuntu:~$ adduser hadoop wup@ubuntu:~$ adduser hadoop hadoop-user |
注释:使用adduser创建用户,不需要自己创建用户目录等,而用useradd则需要 |
(2)配置环境变量
首先切换到hadoop用户:
wup@ubuntu:~$ su hadoop |
创建hadoop安装目录,下载hadoop并移入其中,解压并查看:
hadoop@ubuntu:~$ mkdir ~/hadoop_installs; cd ~/hadoop_installs hadoop@ubuntu:~/hadoop_installs$ tar –zxvf hadoop-2.7.3.tar.gz |
配置环境变量,使用gedit命令打开~/.bash_profile文件,并添加如下内容:
PATH=$PATH:$HOME/bin export JAVA_HOME=/usr/java/jre1.8.0_111 export HADOOP_HOME=/home/hadoop/hadoop_installs/hadoop-2.7.3 export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin export CLASSPATH=$JAVA_HOME/lib:. |
(3)免密钥SSH访问配置
使用如下命令产生认证文件:
hadoop@ubuntu:~$ ssh-keygen -t rsa |
该命令将会在~/.ssh/目录下生成id_rsa.pub认证文件,对它进行重命名即可实现免密钥SSH访问:
hadoop@ubuntu:~$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys hadoop@ubuntu:~/hadoop_installs$ ssh localhost Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-42-generic x86_64)
* Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage
129 packages can be updated. 0 updates are security updates.
*** System restart required *** Last login: Sat Oct 22 06:09:01 2016 from 127.0.0.1 |
(4)配置Hadoop
Hadoop的配置文件在/home/hadoop/ hadoop_installs/hadoop-2.7.3/etc/hadoop目录下。
core-site.xml文件如下:
<configuration> <property> <name>hadoop.tmp.node</name> <value>/tmp/hadoop/hadoop-${user.name}</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> |
hdfs-site.xml文件如下
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/hadoop_installs/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/hadoop_installs/dfs/data</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> |
mapred-site.xml文件如下:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> <property> <name>mapreduce.cluster.local.dir</name> <value>/home/hadoop/hadoop_installs/mapred/local</value> </property> <property> <name>mapreduce.jobtracker.system.dir</name> <value>/home/hadoop/hadoop_installs/mapred/system</value> </property> </configuration> |
(5)NameNode格式化并启动HDFS和MapReduce
为了方便hadoop使用,在/etc/profile中PATH变量后添加路径/home/hadoop/ hadoop_installs/hadoop-2.7.3/bin和/home/hadoop/hadoop_installs/hadoop-2.7.3/sbin。使用如下两条命令格式化NameNode和启动HDFS及MapReduce:
hadoop@wup:~$ hadoop namenode –format hadoop@wup:~$ /home/hadoop/ hadoop_installs/hadoop-2.7.3/sbin/start-all.sh |
使用jps命令可以查看Hadoop运行情况:
hadoop@ubuntu:~/hadoop_installs/hadoop-2.7.3$ jps 26384 DataNode 26754 ResourceManager 26260 NameNode 26600 SecondaryNameNode 31950 Jps 26878 NodeManager |
表明,Hadoop和HDFS已成功启动。
二、WordCount实验
本章将介绍WordCount实验所用的数据,实验的命令以及实验结果,实验结果包括bash下文件的输出结果以及Hadoop Web的结果截图。
1. 实验数据
实验所使用的数据来源于Google新闻语料,我们把三篇文章合并到了一个txt文档中。
2. 实验过程
实验首先在HDFS上创建输入文件目录test,命令如下:
hadoop@ubuntu:~$ hadoop fs -mkdir test |
接着将本地的文件put到HDFS上test目录下,并查看,命令如下:
hadoop@ubuntu:~/txtfile$ hadoop fs -put -f *.txt test hadoop@ubuntu:~$ hadoop fs -ls test Found 3 items -rw-r--r-- 1 hadoop supergroup 24 2016-10-22 21:43 test/file1.txt -rw-r--r-- 1 hadoop supergroup 24 2016-10-22 21:43 test/file2.txt -rw-r--r-- 1 hadoop supergroup 114957 2016-10-23 19:57 test/news.txt 注释:-f是可以overwrite的意思 |
使用InteliJ新建一个Javaproject,编写wordcount程序,如下所示
package example;
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*;
import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer;
/** * Created by hadoop on 10/22/16. */
public class WordCount { ///Mapper: <LongWritable Text> project to <Text IntWritable> public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ///reporter report the progress or they are live public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { //set change string to Text word.set(tokenizer.nextToken()); output.collect(word, one); } } }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ///is called for every key ///IntWritable:you can use it as int public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class); //conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);
//input format conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); } }注释:项目需要导入Hadoop的包,方法:File->Project Structure->Modules->右侧绿色“+”,加入Hadoop目录下lib文件夹 |
导出jar包,
File->Project Structure->Artifacts->绿色“+”->JAR->From modules with dependcies ->apply Build->Build Artifacts |
运行jar包
hadoop@ubuntu:~/txtfile$ hadoop jar WordCount.jar test test-out |
最后查看HDFS上的结果并将结果get到本地:
hadoop@ubuntu:~$ hadoop fs -ls test-out Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2016-10-23 00:59 test-out/_SUCCESS -rw-r--r-- 1 hadoop supergroup 40 2016-10-23 00:59 test-out/part-00000 hadoop@ubuntu:~/exp$ hadoop fs -get /user/hadoop/test_out/part-r-00000 . |
3. 实验结果
运行jar包之后,使用浏览器登录http://localhost:8088,点击左侧Node Labels,可以看到运行状态如图1.
图1.作业运行状态
在bash下查看实验输出结果,如图2所示。
图2 bash下实验输出结果
ps:感谢赵师兄的实验报告模板和指导