写在开头的两个坑:
1.虽然Linux支持一些特殊字符,但java里面并不支持。所以,Hadoop集群中的机器名不仅不能包含下划线,点也不可以。否则,在配置好集群之后运行jar程序的时候会报错。
2.在ubuntu系统下,配置ssh无密码登录的时候,需要修改ssd_config配置文件,具体内容见下文。
第一步,修改四台电脑的机器名,分别为Master、slaver1、slaver3、slaver4。以及通过ifconfig命令查看每台机器的固定ip,并将ip与主机名的映射添加到每一台机器的hosts文件中:
|
第二步,在Master机器上进入root帐号(接下来的所有配置均在root帐号下进行),在/usr目录下新建java文件夹,然后将jdk压缩包解压在java文件夹下,并在/etc/profile中配置JDK环境。
su
root
cd
/usr
mkdir
java
cd
java
tar
-zxvf 压缩包名
|
配置JDK(在etc/profile文件的最后面加上如下代码):
export
JAVA_HOME=
/usr/java/jdk1
.8.0_77
export
CLASSPATH=.:$JAVA_HOME
/lib
:$JAVA_HOME
/jre/lib
:$CLASSPATH
export
PATH=$JAVA_HOME
/bin
:$JAVA_HOME
/jre/bin
:$PATH
|
保存退出之后,运行下面代码让profile立即生效:
source
/etc/profile
java -version
#在终端输入,可查看java版本即查看JDK是否配置成功
|
第三步,将Hadoop压缩包解压在/usr目录下,并在/usr目录下新建spark文件夹,并将scala压缩包、spark压缩包解压在spark文件夹下。
在/etc/profile中配置Hadoop、scala、spark环境变量如下:
还是在终端运行(source /etc/profile)命令,让环境变量立即生效,然后可在终端运行(hadoop --version)查看Hadoop版本。
第四步,安装ssh(无密码登录),并实现Master与其它机器之间的通信。在终端依次输入如下命令:
apt-get
install
ssh
#然后在安装过程中输入Y即可
service sshd stop
service
ssh
stop
service sshd start
service
ssh
start
|
然后修改/etc/ssh/sshd_config文件(将如下代码覆盖掉文件之前的所有内容):
# Package generated configuration file
# See the sshd_config(5) manpage for details
# What ports, IPs and protocols we listen for
Port 22
# Use these options to restrict which interfaces/protocols sshd will bind to
#ListenAddress ::
#ListenAddress 0.0.0.0
Protocol 2
# HostKeys for protocol version 2
HostKey
/etc/ssh/ssh_host_rsa_key
HostKey
/etc/ssh/ssh_host_dsa_key
HostKey
/etc/ssh/ssh_host_ecdsa_key
HostKey
/etc/ssh/ssh_host_ed25519_key
#Privilege Separation is turned on for security
UsePrivilegeSeparation
yes
# Lifetime and size of ephemeral version 1 server key
KeyRegenerationInterval 3600
ServerKeyBits 1024
# Logging
SyslogFacility AUTH
LogLevel INFO
# Authentication:
LoginGraceTime 120
#PermitRootLogin without-password
StrictModes
yes
RSAAuthentication
yes
PubkeyAuthentication
yes
#AuthorizedKeysFile %h/.ssh/authorized_keys
# Don't read the user's ~/.rhosts and ~/.shosts files
IgnoreRhosts
yes
# For this to work you will also need host keys in /etc/ssh_known_hosts
RhostsRSAAuthentication no
# similar for protocol version 2
#HostbasedAuthentication no
# Uncomment if you don't trust ~/.ssh/known_hosts for RhostsRSAAuthentication
#IgnoreUserKnownHosts yes
# To enable empty passwords, change to yes (NOT RECOMMENDED)
PermitEmptyPasswords
yes
# Change to yes to enable challenge-response passwords (beware issues with
# some PAM modules and threads)
ChallengeResponseAuthentication no
# Change to no to disable tunnelled clear text passwords
PasswordAuthentication
yes
# Kerberos options
#KerberosAuthentication no
#KerberosGetAFSToken no
#KerberosOrLocalPasswd yes
#KerberosTicketCleanup yes
# GSSAPI options
#GSSAPIAuthentication no
#GSSAPICleanupCredentials yes
X11Forwarding
yes
X11DisplayOffset 10
PrintMotd no
PrintLastLog
yes
TCPKeepAlive
yes
#UseLogin no
#MaxStartups 10:30:60
#Banner /etc/issue.net
# Allow client to pass locale environment variables
AcceptEnv LANG LC_*
Subsystem
sftp
/usr/lib/openssh/sftp-server
# Set this to 'yes' to enable PAM authentication, account processing,
# and session processing. If this is enabled, PAM authentication will
# be allowed through the ChallengeResponseAuthentication and
# PasswordAuthentication. Depending on your PAM configuration,
# PAM authentication via ChallengeResponseAuthentication may bypass
# the setting of "PermitRootLogin without-password".
# If you just want the PAM account and session checks to run without
# PAM authentication, then enable this but set PasswordAuthentication
# and ChallengeResponseAuthentication to 'no'.
UsePAM
yes
|
重启ssh,让配置文件生效:
service sshd restart
|
查看/root/.ssh目录下是否存在文件,如果有则删除,没有则开始下一步。
在终端运行如下命令:
cd
/root/
.
ssh
ssh
-keygen -t rsa
#三次回车
ssh
-copy-
id
-i
/root/
.
ssh
/id_rsa
.pub master
ssh
master
|
通过ssh登录master(第一次需要输入密码)成功之后,exit退出然后再次登录,如果不需要再输入密码,即ssh无密码登录成功,否则删除/root/.ssh文件夹下所有文件,并重新安装。
接下来,即可通过以下命令来建立与其它机器之间的通信:
ssh
-copy-
id
-i
/root/
.
ssh
/id_rsa
.pub slaver1
ssh
-copy-
id
-i
/root/
.
ssh
/id_rsa
.pub slaver3
ssh
-copy-
id
-i
/root/
.
ssh
/id_rsa
.pub slaver4
|
第五步,配置Hadoop。在/usr/hadoop-2.7.2路径下,依次输入如下命令:
mkdir
tmp
mkdir
hdfs
mkdir
hdfs
/name
mkdir
hdfs
/data
|
修改hadoop-2.7.2/etc/hadoop下的配置文件core-site.xml,在<configuration></configuration>中间加上如下代码:
<property>
<name>fs.defaultFS<
/name
>
<value>hdfs:
//Master
:9000<
/value
>
<
/property
>
<property>
<name>hadoop.tmp.
dir
<
/name
>
<value>
file
:
/usr/hadoop-2
.7.2
/tmp
<
/value
>
<
/property
>
<property>
<name>io.
file
.buffer.size<
/name
>
<value>131702<
/value
>
<
/property
>
|
修改hadoop-2.7.2/etc/hadoop下的配置文件hdfs-site.xml,在<configuration></configuration>中间加上如下代码:
<property>
<name>dfs.namenode.secondary.http-address<
/name
>
<value>Master:50090<
/value
>
<
/property
>
<property>
<name>dfs.namenode.name.
dir
<
/name
>
<value>
file
:
/usr/hadoop-2
.7.2
/hdfs/name
<
/value
>
<
/property
>
<property>
<name>dfs.datanode.data.
dir
<
/name
>
<value>
file
:
/usr/hadoop-2
.7.2
/hdfs/data
<
/value
>
<
/property
>
<property>
<name>dfs.replication<
/name
>
<value>3<
/value
>
<
/property
>
<property>
<name>dfs.webhdfs.enabled<
/name
>
<value>
true
<
/value
>
<
/property
>
|
修改hadoop-2.7.2/etc/hadoop下的配置文件mapred-site.xml,在<configuration></configuration>中间加上如下代码:
<property>
<name>mapreduce.framework.name<
/name
>
<value>yarn<
/value
>
<
/property
>
<property>
<name>mapreduce.jobhistory.address<
/name
>
<value>Master:10020<
/value
>
<
/property
>
<property>
<name>mapreduce.jobhistory.webapp.address<
/name
>
<value>Master:19888<
/value
>
<
/property
>
<property>
<name>mapreduce.jobtracker.http.address<
/name
>
<value>Master:50030<
/value
>
<
/property
>
|
修改hadoop-2.7.2/etc/hadoop下的配置文件yarn-site.xml,在<configuration></configuration>中间加上如下代码:
<property>
<name>yarn.nodemanager.aux-services<
/name
>
<value>mapreduce_shuffle<
/value
>
<
/property
>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class<
/name
>
<value>org.apache.hadoop.mapred.ShuffleHandler<
/value
>
<
/property
>
<property>
<name>yarn.resourcemanager.address<
/name
>
<value>Master:8032<
/value
>
<
/property
>
<property>
<name>yarn.resourcemanager.scheduler.address<
/name
>
<value>Master:8030<
/value
>
<
/property
>
<property>
<name>yarn.resourcemanager.resource-tracker.address<
/name
>
<value>Master:8031<
/value
>
<
/property
>
<property>
<name>yarn.resourcemanager.admin.address<
/name
>
<value>Master:8033<
/value
>
<
/property
>
<property>
<name>yarn.resourcemanager.webapp.address<
/name
>
<value>Master:8088<
/value
>
<
/property
>
|
然后配置/hadoop-2.7.2/etc/hadoop目录下的hadoop-env.sh、yarn-env.sh,在其中添加如下代码:
export
JAVA_HOME=
/usr/java/jdk1
.8.0_77
|
接着配置/hadoop-2.7.2/etc/hadoop目录下slaves,在里面加上你的从服务器主机名(也就是其它机器的主机名)。
以上全部配置成功后,将所有配置文件发送到各个从服务器:
scp
-r
/usr/hadoop-2
.7.2 root@slaver1:
/usr
scp
-r
/usr/java
root@slaver1:
/usr
scp
-r
/etc/profile
root@slaver1:
/etc/profile
scp
-r
/usr/hadoop-2
.7.2 root@slaver3:
/usr
scp
-r
/usr/java
root@slaver3:
/usr
scp
-r
/etc/profile
root@slaver3:
/etc/profile
scp
-r
/usr/hadoop-2
.7.2 root@slaver4:
/usr
scp
-r
/usr/java
root@slaver4:
/usr
scp
-r
/etc/profile
root@slaver4:
/etc/profile
|
最后在Master机器上执行如下命令进行namenode的格式化:
cd
$HADOOP_HOME
bin
/hdfs
namenode -
format
|
最后就可以用以下命令来启动Hadoop集群:
cd
sbin
.
/satrt-all
.sh
# 或者是用下面三行命令
.
/start-dfs
.sh
.
/start-yarn
.sh
.
/mr-jobhistory-daemon
.sh start historyserver
|
启动集群之后,就可以利用集群来实现最简单的单词计数:
cd
$HADOOP_HOME
/bin
.
/hadoop
fs -
mkdir
/wmh
# 在hdfs上新建文件夹
.
/hadoop
fs -put ..
/etc/hadoop/yarn-env
.sh
/wmh
# 上传文件到hdfs
# 运行wordcount
.
/hadoop
jar
/usr/hadoop-2
.7.2
/share/hadoop/mapreduce/hadoop-mapreduce-examples-2
.7.2.jar
wordcount
/wmh/yarn-env
.sh
/wmh/yarn_out
|
然后在浏览器打开master:8088查看信息。
第六步,进行Spark集群配置。
cd
/usr/spark/spark-1
.6.1-bin-hadoop2.6
/conf
cp
spark-
env
.sh.template spark-
env
.sh
cd
slaves.template slaves
vim spark-
env
.sh
#在最下面添加如下代码
export
SCALA_HOME=
/usr/spark/scala-2
.10.5
export
JAVA_HOME=
/usr/java/jdk1
.8.0_77
export
SPARK_WORKER_MEMORY=2g
export
HADOOP_CONF_DIR=
/usr/hadoop-2
.7.2
/etc/hadoop
vim slaves
#删除文件中的localhost,并添加节点信息如下
slaver1
slaver3
slaver4
|
到此,spark已经配置完成,然后将Master上配置好的spark、scala以及环境变量发送到每一个slave上去:
scp
-r
/usr/spark
root@slaver1:
/usr
scp
-r
/etc/profile
root@slaver1:
/etc/profile
ssh
slaver1
source
/etc/profile
exit
scp
-r
/usr/spark
root@slaver3:
/usr
scp
-r
/etc/profile
root@slaver3:
/etc/profile
ssh
slaver1
source
/etc/profile
exit
scp
-r
/usr/spark
root@slaver4:
/usr
scp
-r
/etc/profile
root@slaver4:
/etc/profile
ssh
slaver1
source
/etc/profile
exit
|
启动spark集群(可在master:8080查看):
cd
/usr/spark/spark-1
.6.1-bin-hadoop2.6
/sbin
.
/start-all
.sh
# 进入scala编辑窗口
cd
..
/bin
.
/spark-shell
|
关闭spark集群:
cd
/usr/spark/spark-1
.6.1-bin-hadoop2.6
/sbin
.
/stop-all
.sh
|
开启spark集群之后,可以利用spark集群实现最简单的单词计数:
# 运行bin下的spark-shell进入scala编辑窗口
val rdd = sc.textFile(“
/wmh/yarn-env
.sh”)
# 读取hdfs上的文件
val wordcount = rdd.flatMap(_.
split
(‘ ’)).map((_,1)).reduceByKey(_+_)
wordcount.collect
# 可在master:4040查看具体信息
|
如果会用python,也可以运行bin下的pyspark进入python编辑窗口,输入以下命令来实现单词计数:
rdd = sc.textFile(“
/wmh/yarn-env
.sh”)
wordcount = rdd.flatMap(lambda x: x.
split
(
' '
)).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)
wordcount.collect()