全文作为个人记录用。不做任何参考。
环境搭建参考:http://www.ityouknow.com/hadoop/2017/07/24/hadoop-cluster-setup.html
词频代码参考:https://blog.csdn.net/a60782885/article/details/71308256
1、环境搭建
总共选择了3台虚拟机作为本次的主角
master:192.168.21.130
slave1:192.168.21.131
slave2:192.168.21.132
1.1、首先是虚拟机的安装,物理主机是win10,虚拟机用的是Centos7,采用最小化方式安装,安装完后,有可能需要激活网卡,修改/etc/sysonfig/network-scripts/ifcfg-xxxx(我的是ifcfg-ens33),将ONBOOT=no修改为yes,使得能够联网。如下所示:
1.2、依次安装完3台虚拟机后,再修改主机的名字,依次为 master、slave1、slave2。修改文件/etc/sysconfig/network,在master机器中加入:HOSTNAME=master,其他机器中依次加入 HOSTNAME=slave1,HOSTNAME=slave2.
1.3、修改三台机器的hosts 加入下面这段话(具体ip视自己的机子而定):
1.4、软件的安装,首先是jdk的安装。
http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz
wget http://download.oracle.com/otn-pub/java/jdk/8u161-b12/2f38c3b165be4555a1fa6e98c45e0808/jdk-8u161-linux-x64.tar.gz
tar -zxvf
jdk-8u161-linux-x64.tar.gz
mv jdk-8u151-linux-x64 jdk180161
修改环境变量:
export JAVA_HOME=/root/jdk180161
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
source /etc/profile
1.5、免密登陆
免密登陆的思想:
A机器能够免密登陆B机器。
首先在A机器上生成密钥:
ssh-keygen -t rsa
然后将密钥拷贝到B机器的authorized_keys中,就可以了。
这里以master远程免密登陆slave1为例。
①、登陆master,执行 ssh-keygen -t rsa ,可以一路回车。
②、登陆slave1,执行 scp root@master:~/.ssh/id_rsa.pub /root/
③、在slave1上,执行 cat /root/id_rsa.pub >> ~/.ssh/authorized_keys。
(如果失败,执行一下 chmod 600 .ssh/authorized_keys)
④、在master上测试 ssh slave1,能够登陆则成功。
然后依次配置三台机器之间的免密登陆和本机的免密登陆(例如在master 中执行 ssh master,可以登陆)
1.6 Hadoop配置
依次在3台机器上执行。
wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.5/hadoop-2.7.5.tar.gz
tar -zxvf
hadoop-2.7.5.tar.gz
修改环境变量:
export HADOOP_HOME=/root/hadoop-2.7.5
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/bin
依次修改 hadoop的配置文件,在hadoop的安装目录下的/etc/hadoop中。
一共有4个文件修改:
①、
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/root/hadoop-2.7.5/tmp</value><!--修改为自己hadoop的安装目录下的tmp-->
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value><!--这里的master名字和主节点名字一样,如果主节点不叫master,就换掉-->
</property>
</configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/root/hadoop-2.7.5/tmp</value><!--修改为自己hadoop的安装目录下的tmp-->
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value><!--这里的master名字和主节点名字一样,如果主节点不叫master,就换掉-->
</property>
</configuration>
②、
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value><!--根据实际情况定-->
</property>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoop-2.7.5/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoop-2.7.5/hdfs/data</value>
</property>
</configuration>
<name>dfs.replication</name>
<value>2</value><!--根据实际情况定-->
</property>
<property>
<name>dfs.name.dir</name>
<value>/root/hadoop-2.7.5/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/root/hadoop-2.7.5/hdfs/data</value>
</property>
</configuration>
③、
拷贝mapred-site.xml.template为mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://master:9001</value>
</property>
</configuration>
reducer:
runner:
打包成jar包后,放到集群上运行。
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>http://master:9001</value>
</property>
</configuration>
④、
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
接下来是2个比较重要的更改,在三台机器上
vi /root/hadoop-2.7.5/etc/hadoop/masters
添加 master
在master主机(master特有)
vi /root/hadoop-2.7.5/etc/hadoop/slaves
## 添加
slave1
slave2
## 添加
slave1
slave2
1.7 Hadoop启动
1.7.1 格式化HDFS文件系统
bin/hadoop namenode -format(hadoop目录下执行)
1.7.2 启动hadoop
sbin/start-all.sh
1.8、可能出现的问题
JAVA_HOME is not set and could not be found
vi /root/hadoop-2.7.5/etc/hadoop/hadoop-env.sh
## 配置项
export JAVA_HOME=你的jdk路径
## 配置项
export JAVA_HOME=你的jdk路径
1.9、词频程序
程序很简单。一步带过。
maven建立quick-start工程。
pom.xml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.edu.bupt.wcy</groupId>
<artifactId>wordcount</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>wordcount</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.1</version>
</dependency>
</dependencies>
</project>
3个java代码,mapper、reducer、runner主类:
mapper:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, LongWritable>{
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
//super.map(key, value, context);
//String[] words = StringUtils.split(value.toString());
String[] words = StringUtils.split(value.toString(), " ");
for(String word:words)
{
context.write(new Text(word), new LongWritable(1));
}
}
}
reducer:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text arg0, Iterable<LongWritable> arg1,
Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
//super.reduce(arg0, arg1, arg2);
int sum=0;
for(LongWritable num:arg1)
{
sum += num.get();
}
context.write(arg0,new LongWritable(sum));
}
}
runner:
package cn.edu.bupt.wcy.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCountRunner {
public static void main(String[] args) throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordCountRunner.class);
job.setJobName("wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
}
打包成jar包后,放到集群上运行。
先在集群上新建一个文件夹:
hdfs dfs -mkdir /input_wordcount
再放入单词文件,比如:
hello world
I like playing basketball
hello java
。。。
运行hadoop jar WordCount.jar(jar包) WordCountRunner(主类) /input_wordcount /output_wordcount
运行完成后,查看:
hdfs dfs -ls /output_wordcount。已经生成了结果,在cat一下查看内容即可。