由于CDH版本自带的Spark1.6,而官方网站上所写的Spark2配置方法似乎需要CM,而MAC OS不能使用CM。出于学习Spark2需要我只得配置原生Apache Hadoop伪分布,详细记录一下整个过程。
1、配置环境:macOS Sierra 10.12.6
jdk: : java version "1.8.0_131"
maven : Apache Maven 3.5.0
scala : Scala 2.12.2
mysql. : Server version: 5.7.18 MySQL Community Server
ssh无密码登录配置完成
2、包下载:
apache-flume-1.7.0-bin.tar
apache-hive-2.3.0-bin.tar
apache-mahout-distribution-0.13.0.tar
hadoop-2.8.1.tar
hbase-1.3.1-bin.tar
kafka_2.12-0.11.0.0.tar
pig-0.17.0.tar
spark-2.2.0-bin-hadoop2.7.tar
sqoop-1.99.7-bin-hadoop200.tar
sqoop-1.99.7.tar
zookeeper-3.4.9.tar
3、Hadoop伪分布配置
3.1 在用户主目录下创建Hadoop/目录,将解压后的hadoop-2.8.1移动到该目录下
cd ~
mkdir Hadoop/
mv ~/Downloads/hadoop-2.8.1 ~/Hadoop/
3.2 配置hdfs-site.xml
cd ~/Hadoop/hadoop-2.8.1/etc/hadoop
vi hdfs-site.xml
添加内容如下:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/hwg/Hadoop/hadoop-2.8.1/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/hwg/Hadoop/hadoop-2.8.1/dfs/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>true</value>
<description>'true'打开权限,'false'关闭权限</description>
</property>
</configuration>
3.3 更改hadoop-env.sh
cd ~/Hadoop/hadoop-2.8.1/etc/hadoop
vi hadoop-env.sh
修改JDK配置:
# The java implementation to use.
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home"
3.4 修改core-site.xmlcd ~/Hadoop/hadoop-2.8.1/etc/hadoop
vi core-site.sh
添加配置:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/hwg/Documents/apache/hadoop-2.8.1/hadoop_tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://localhost:9001</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
3.5 配置mapred-site.xml
cd ~/Hadoop/hadoop-2.8.1/etc/hadoop
cp mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
添加配置如下:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
3.6 启动Hadoop
~/Hadoop/hadoop-2.8.1/bin/hadoop namenode -format
~/Hadoop/hadoop-2.8.1/sbin/start-all.sh
如果没有报错,使用jps查看运行状态
13760 NodeManager
13686 ResourceManager
13591 SecondaryNameNode
13435 NameNode
13503 DataNode
4、HIVE配置
4.1 元数据库配置
我使用本地MySQL作为HIVE元数据库
安装完成后,登入 mysql -u root
创建用户 mysql> create user 'hive' identified by '123456';
授权 mysql> grant all on *.* to 'hive'@'%' identified by '123456';
mysql> grant all on *.* to 'hive'@'localhost' identified by '123456';
创建元数据库 mysql> create database metastore;
初始化 ~/Hadoop/apache-hive-2.3.0-bin/bin/schematool -initSchema --dbType mysql
4.2 hive-site.xml配置
cd ~/Hadoop/apache-hive-2.3.0-bin/conf
cp hive-default.xml.template hive-site.xml
vi hive-site.xml
以下几项需要修改:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore?characterEncoding=UTF-8
</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
<description>Auto creates necessary schema on a startup if one doesn't exist. Set this to false, after creating it once.To enable auto create also set hive.metastore.schema.verification=false. Auto creation is not recommended for production use cases, run schematool command instead.</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive/iotmp</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/tmp/hive/iotmp</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/Users/hwg/hive/iotmp</value>
<description>Location of Hive run time structured log file</description>
</property>
4.3 hive-env.sh配置
添加如下内容:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
export HIVE_CONF_DIR=/Users/hwg/Hadoop/apache-hive-2.3.0-bin/conf
export HADOOP_HOME=/Users/hwg/Hadoop/hadoop-2.8.1
4.4 添加MySQL Connector
cp ~/Downloads/mysql-connector-java-5.1.42/mysql-connector-java-5.1.42-bin.jar ~/Hadoop/apache-hive-2.3.0-bin/lib/
4.5 启动HIVE
~/Hadoop/apache-hive-2.3.0-bin/bin/hive
5、 HBase配置
5.1 hbase-env.sh配置
cd ~/Hadoop/hbase-1.3.1/
vi hbase-env.sh
添加如下内容:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
export HBASE_CLASSPATH=/Users/hwg/Hadoop/hbase-1.3.1/conf
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
export HBASE_MANAGES_ZK=true
5.2 hbase-site.xml配置
cd ~/Hadoop/hbase-1.3.1/
vi hbase-site.xml
添加如下配置:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000</value>
<description>此参数指定了HRegion服务器的位置,即数据存放位置</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>此参数指定了Hlog和Hfile的副本个数,此参数的设置不能大于HDFS的节点数。伪分布式下DataNode只有一台,因此此参数应设置为1</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
5.3 启动HBase
先要启动HDFS
然后启动HBase:
cd ~/Hadoop/hbase-1.3.1/
bin/start-hbase.sh
jps查看运行情况:
32161 ResourceManager
31954 DataNode
33458 HRegionServer
33523 Jps
31876 NameNode
32052 SecondaryNameNode
32244 NodeManager
33303 HQuorumPeer
33352 HMaster
OK!
6、zookeeper配置(伪分布下似乎没什么用)
6.1 配置ZK_HOME环境变量
vi ~/.profile
添加如下内容:
export ZK_HOME="/Users/hwg/Hadoop/zookeeper-3.4.9"
export PATH=${ZK_HOME}/bin:${JAVA_HOME}:${PATH}
source ~/.profile使之立即生效
测试:
zkServer.sh status
显示:
ZooKeeper JMX enabled by default
Using config: /Users/hwg/Hadoop/zookeeper-3.4.9/bin/../conf/zoo.cfg
grep: /Users/hwg/Hadoop/zookeeper-3.4.9/bin/../conf/zoo.cfg: No such file or directory
mkdir: : No such file or directory
grep: /Users/hwg/Hadoop/zookeeper-3.4.9/bin/../conf/zoo.cfg: No such file or directory
grep: /Users/hwg/Hadoop/zookeeper-3.4.9/bin/../conf/zoo.cfg: No such file or directory
Error contacting service. It is probably not running.
OK!正常,接着配置
6.2 zoo.cfg 配置
http://blog.csdn.net/u011523533/article/details/48626199
这篇文章写得很详细,并且按此步骤来可以配置成功
7、Mahout配置
7.1 添加环境变量
vi ~/.profile
添加内容
export MAHOUT_HOME="/Users/hwg/Hadoop/apache-mahout-distribution-0.13.0"
export MAHOUT_CONF_DIR="/Users/hwg/Hadoop/apache-mahout-distribution-0.13.0/conf"
export PATH=${MAHOUT_HOME}/bin:${MAHOUT_CONF_DIR}:$PATH
source ~/.profile 使之立即生效
7.2 运行mahout
mahout
显示:
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /Users/hwg/Hadoop/hadoop-2.8.1/bin/hadoop HADOOP_CONF_DIR=/Users/hwg/Hadoop/hadoop-2.8.1/etc/hadoop
MAHOUT-JOB: /Users/hwg/Hadoop/apache-mahout-distribution-0.13.0/mahout-examples-0.13.0-job.jar.
An example program must be given as the first argument.
Valid program names are:
没有问题,其中MAHOUT_LOCAL is not set;正常。
8、Flume安装
8.1 添加环境变量
vi ~/.profile
添加如下内容:
export FLUME_HOME="/Users/hwg/Hadoop/apache-flume-1.7.0-bin"
export PATH=${FLUME_HOME}/bin:${PATH}
source ~/.profile使之立即生效
8.2 flume-env.sh配置
cd $FLUME_HOME
cp flume-env.sh.template flume-env.sh
vi flume-env.sh
添加如下内容:
export JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home"
export HADOOP_HOME="/Users/hwg/Hadoop/hadoop-2.8.1"
8.3 版本验证
flume-ng version
这里我遇到了问题:
错误: 找不到或无法加载主类 org.apache.flume.tools.GetJavaProperty
原因:安装hbase即出现该错误
解决:注释掉$HIVE_HOME/conf/hbase-env.sh中
export HBASE_CLASSPATH=/Users/hwg/Hadoop/hbase-1.3.1/conf
这一行
再次版本验证:
显示:
Flume 1.7.0
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: 511d868555dd4d16e6ce4fedc72c2d1454546707
Compiled by bessbd on Wed Oct 12 20:51:10 CEST 2016
From source with checksum 0d21b3ffdc55a07e1d08875872c00523
OK!
9、Spark安装配置
9.1 添加环境变量
export SPARK_HOME="/Users/hwg/Hadoop/spark-2.2.0-bin-hadoop2.7"
export PATH=${SPARK_HOME}/bin:${PATH}
9.2 slaves配置
cd $SPARK_HOME
cp slaves.template slaves
9.3 spark-env.sh配置
cd $SPARK_HOME/conf/
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
添加如下内容:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
export HADOOP_HOME=/Users/hwg/Hadoop/hadoop-2.8.1
export SCALA_HOME=/Users/hwg/scala-2.12.2
export HADOOP_CONF_DIR=/Users/hwg/Hadoop/hadoop-2.8.1/etc/hadoop
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_MEMORY=512M
9.4 Spark启动
前提:Hadoop伪分布已启动
$SPARK_HOME/sbin/start-all.sh
$SPARK_HOME/bin/spark-shell
这里我遇到了一个问题:
org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxr-xr-x;
解决方法:
hadoop fs -chmod 777 /tmp/hive
再次启动spark-shell:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/09/07 21:38:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/09/07 21:38:08 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://127.0.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1504791481406).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
at
... ...
NestedThrowablesStackTrace:
java.lang.reflect.InvocationTargetException
... ...
Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "BONECP" plugin to create a ConnectionPool gave an error : The specified datastore driver
("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH
specification, and the name of the driver.
... ...
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The
specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please
check your CLASSPATH specification, and the name of the driver.
... ...
解决方法:
vi $HIVE_HOME/conf/hive-site.xml
修改hive.metastore.uris配置项
<name>hive.metastore.uris</name>
<value>thrift://namenode1:9083</value>
将修改后的hive-site重新拷贝至$SPARK_HOME/conf/下
rm $SPARK_HOME/conf/hive-site.xml
cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
再次运行$SPARK_HOME/bin/spark-shell
不再报错
测试:
scala> spark.sql("select * from student").show() 注:student表为我在hive中创建的
显示:
+-----+---+-----+
| name|age|score|
+-----+---+-----+
| John| 20| 88.0|
|Marry| 21| 93.0|
| Pet| 22| 78.0|
| Tom| 22| 89.0|
| Judy| 22| 90.0|
| Andy| 24| 91.0|
+-----+---+-----+
OK!
10、Sqoop2安装配置
10.1 配置Hadoop代理访问
vi $HADOOP_HOME/etc/hadoop/core-site.xml
添加如下内容:
<property> <name>hadoop.proxyuser.$SERVER_USER.hosts</name> <value>hwg</value> </property> <property> <name>hadoop.proxyuser.$SERVER_USER.groups</name> <value>*</value> </property>
10.2 添加环境变量
vi ~/.profile
添加一下内容:
export SQOOP_HOME="/Users/hwg/Hadoop/sqoop-1.99.7-bin-hadoop200"
export SQOOP_SERVER_EXTRA_LIB=$SQOOP_HOME/extra #需要将mysql驱动拷贝到该目录下
10.3 SQOOP服务器配置
配置$SQOOP_HOME/conf/sqoop_bootstrap.properties:
使用默认值即可
配置$SQOOP_HOME/conf/sqoop.properties
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/Users/hwg/Hadoop/hadoop-2.8.1/etc/hadoop
org.apache.sqoop.security.authentication.type=SIMPLE
org.apache.sqoop.security.authentication.handler=org.apache.sqoop.security.authentication.SimpleAuthenticationHandler
org.apache.sqoop.security.authentication.anonymous=true
10.4 验证安装成功
$SQOOP_HOME/bin/sqoop2-tool verify
显示:
Sqoop home directory: /Users/hwg/Hadoop/sqoop-1.99.7-bin-hadoop200
Sqoop tool executor:
Version: 1.99.7
Revision: 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb
Compiled on Tue Jul 19 16:08:27 PDT 2016 by abefine
Running tool: class org.apache.sqoop.tools.tool.VerifyTool
0 [main] INFO org.apache.sqoop.core.SqoopServer - Initializing Sqoop server.
9 [main] INFO org.apache.sqoop.core.PropertiesConfigurationProvider - Starting config file poller thread
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/hwg/Hadoop/hadoop-2.8.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/hwg/Hadoop/apache-hive-2.3.0-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Verification was successful.
Tool class org.apache.sqoop.tools.tool.VerifyTool has finished correctly.
OK!