安装伪分布式Hadoop系统与WordCount程序实验

一、系统安装运行情况

       下面将分别介绍实验的系统环境及hadoop安装运行情况。

1. 实验环境

       处理器:Intel Core i5-2500S CPU @2.70GHz

       系统:Ubuntu 16.04 LTS 64bit

       内存:2GB

       磁盘:20 GB

       Java环境:

java version "1.8.0_111"

Java(TM) SE Runtime Environment (build 1.8.0_111-b14)

Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

       jdk路径:/usr/java/jdk1.8.0_111

2. Hadoop安装运行情况

       下面按顺序介绍hadoop的安装流程。

(1)创建hadoop-user用户组和hadoop用户

创建hadoop-user用户组和hadoop用户:

wup@ubuntu:~$ adduser hadoop

wup@ubuntu:~$ adduser hadoop hadoop-user

注释:使用adduser创建用户,不需要自己创建用户目录等,而用useradd则需要

(2)配置环境变量

首先切换到hadoop用户:

wup@ubuntu:~$ su hadoop

创建hadoop安装目录,下载hadoop并移入其中,解压并查看:

hadoop@ubuntu:~$ mkdir ~/hadoop_installs; cd ~/hadoop_installs

hadoop@ubuntu:~/hadoop_installs$ tar –zxvf hadoop-2.7.3.tar.gz

配置环境变量,使用gedit命令打开~/.bash_profile文件,并添加如下内容:

PATH=$PATH:$HOME/bin

export JAVA_HOME=/usr/java/jre1.8.0_111

export HADOOP_HOME=/home/hadoop/hadoop_installs/hadoop-2.7.3

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin

export CLASSPATH=$JAVA_HOME/lib:.

(3)免密钥SSH访问配置

使用如下命令产生认证文件:

hadoop@ubuntu:~$ ssh-keygen -t rsa

该命令将会在~/.ssh/目录下生成id_rsa.pub认证文件,对它进行重命名即可实现免密钥SSH访问:

hadoop@ubuntu:~$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

hadoop@ubuntu:~/hadoop_installs$ ssh localhost

Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-42-generic x86_64)

 

 * Documentation:  https://help.ubuntu.com

 * Management:     https://landscape.canonical.com

 * Support:        https://ubuntu.com/advantage

 

129 packages can be updated.

0 updates are security updates.

 

*** System restart required ***

Last login: Sat Oct 22 06:09:01 2016 from 127.0.0.1

(4)配置Hadoop

       Hadoop的配置文件在/home/hadoop/ hadoop_installs/hadoop-2.7.3/etc/hadoop目录下。

core-site.xml文件如下:

<configuration>

        <property>

                <name>hadoop.tmp.node</name>

                <value>/tmp/hadoop/hadoop-${user.name}</value>

        </property>

        <property>

                <name>fs.default.name</name>

                <value>hdfs://localhost:9000</value>

        </property>

</configuration>

hdfs-site.xml文件如下

<configuration>

        <property>

                <name>dfs.namenode.name.dir</name>

                <value>/home/hadoop/hadoop_installs/dfs/name</value>

        </property>

        <property>

                <name>dfs.datanode.data.dir</name>

                <value>/home/hadoop/hadoop_installs/dfs/data</value>

        </property>

        <property>

                <name>dfs.replication</name>

                <value>1</value>

        </property>

</configuration>

mapred-site.xml文件如下:

<configuration>

        <property>

                <name>mapred.job.tracker</name>

                <value>localhost:9001</value>

        </property>

        <property>

                <name>mapreduce.cluster.local.dir</name>

                <value>/home/hadoop/hadoop_installs/mapred/local</value>

        </property>

        <property>

                <name>mapreduce.jobtracker.system.dir</name>

                <value>/home/hadoop/hadoop_installs/mapred/system</value>

        </property>

</configuration>

(5)NameNode格式化并启动HDFS和MapReduce

为了方便hadoop使用,在/etc/profile中PATH变量后添加路径/home/hadoop/ hadoop_installs/hadoop-2.7.3/bin和/home/hadoop/hadoop_installs/hadoop-2.7.3/sbin。使用如下两条命令格式化NameNode和启动HDFS及MapReduce:

hadoop@wup:~$ hadoop namenode –format

hadoop@wup:~$ /home/hadoop/ hadoop_installs/hadoop-2.7.3/sbin/start-all.sh

使用jps命令可以查看Hadoop运行情况:

hadoop@ubuntu:~/hadoop_installs/hadoop-2.7.3$ jps

26384 DataNode

26754 ResourceManager

26260 NameNode

26600 SecondaryNameNode

31950 Jps

26878 NodeManager

表明,Hadoop和HDFS已成功启动。

 

二、WordCount实验

       本章将介绍WordCount实验所用的数据,实验的命令以及实验结果,实验结果包括bash下文件的输出结果以及Hadoop Web的结果截图。

1. 实验数据

       实验所使用的数据来源于Google新闻语料,我们把三篇文章合并到了一个txt文档中。

2. 实验过程

       实验首先在HDFS上创建输入文件目录test,命令如下:

hadoop@ubuntu:~$ hadoop fs -mkdir test

接着将本地的文件put到HDFS上test目录下,并查看,命令如下:

hadoop@ubuntu:~/txtfile$ hadoop fs -put -f *.txt test

hadoop@ubuntu:~$ hadoop fs -ls test

Found 3 items

-rw-r--r--   1 hadoop supergroup         24 2016-10-22 21:43 test/file1.txt

-rw-r--r--   1 hadoop supergroup         24 2016-10-22 21:43 test/file2.txt

-rw-r--r--   1 hadoop supergroup     114957 2016-10-23 19:57 test/news.txt

注释:-f是可以overwrite的意思

使用InteliJ新建一个Javaproject,编写wordcount程序,如下所示

package example;

 

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;

 

import java.io.IOException;

import java.util.Iterator;

import java.util.StringTokenizer;

 

/**

 * Created by hadoop on 10/22/16.

 */

 

public class WordCount {

    ///Mapper: <LongWritable Text> project to <Text IntWritable>

    public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

        ///reporter report the progress or they are live

        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

            String line = value.toString();

            StringTokenizer tokenizer = new StringTokenizer(line);

            while (tokenizer.hasMoreTokens()) {

                //set change string to Text

                word.set(tokenizer.nextToken());

                output.collect(word, one);

            }

        }

    }

 

    public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

        ///is called for every key

        ///IntWritable:you can use it as int

        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

            int sum = 0;

            while (values.hasNext()) {

                sum += values.next().get();

            }

            output.collect(key, new IntWritable(sum));

        }

    }

 

    public static void main(String[] args) throws Exception {

        JobConf conf = new JobConf(WordCount.class);

        conf.setJobName("wordcount");

 

        conf.setOutputKeyClass(Text.class);

        conf.setOutputValueClass(IntWritable.class);

 

        conf.setMapperClass(Map.class);

        //conf.setCombinerClass(Reduce.class);

        conf.setReducerClass(Reduce.class);

 

        //input format

        conf.setInputFormat(TextInputFormat.class);

        conf.setOutputFormat(TextOutputFormat.class);

 

        FileInputFormat.setInputPaths(conf, new Path(args[0]));

        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

 

        JobClient.runJob(conf);

    }

}注释:项目需要导入Hadoop的包,方法:File->Project Structure->Modules->右侧绿色“+”,加入Hadoop目录下lib文件夹

导出jar包,

File->Project Structure->Artifacts->绿色“+”->JAR->From modules with dependcies

->apply

Build->Build Artifacts

运行jar包

hadoop@ubuntu:~/txtfile$ hadoop jar WordCount.jar test test-out

最后查看HDFS上的结果并将结果get到本地:

hadoop@ubuntu:~$ hadoop fs -ls test-out

Found 2 items

-rw-r--r--   1 hadoop supergroup          0 2016-10-23 00:59 test-out/_SUCCESS

-rw-r--r--   1 hadoop supergroup         40 2016-10-23 00:59 test-out/part-00000

hadoop@ubuntu:~/exp$ hadoop fs -get /user/hadoop/test_out/part-r-00000 .

3. 实验结果

       运行jar包之后,使用浏览器登录http://localhost:8088,点击左侧Node Labels,可以看到运行状态如图1.


图1.作业运行状态

       在bash下查看实验输出结果,如图2所示。


图2 bash下实验输出结果

ps:感谢赵师兄的实验报告模板和指导


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值