ubuntu搭建Hadoop+spark+mysql+hive伪分布学习环境

图特摩斯科技

已于 2022-03-28 18:31:04 修改

阅读量5.2k

点赞数 1

分类专栏： hadoop spark hive 文章标签： hadoop ubuntu spark hive mysql

于 2016-04-23 14:29:43 首次发布

本文链接：https://blog.csdn.net/lovebyz/article/details/51226200

版权

spark 同时被 3 个专栏收录

43 篇文章 1 订阅

订阅专栏

hive

21 篇文章 1 订阅

订阅专栏

hadoop

5 篇文章 0 订阅

订阅专栏

按步骤走就行：

（1）raini@biyuzhe:~$ gedit .bashrc

#java
export JAVA_HOME=/home/raini/app/jdk1.7.0_79
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib:$CLASSPATH
export PATH=${JAVA_HOME}/bin:$JRE_HOME/bin:$PATH

#scala
export SCALA_HOME=/home/raini/app/scala-2.10.6
export PATH=${SCALA_HOME}/bin:$PATH

#spark
export SPARK_HOME=/home/raini/spark1
export PATH=$PATH:$SPARK_HOME/bin:$PATH

# hadoop2.6
export HADOOP_PREFIX=/home/raini/hadoop2
export CLASSPATH=".:$JAVA_HOME/lib:$CLASSPATH"
export PATH="$JAVA_HOME/:$HADOOP_PREFIX/bin:$PATH"
export HADOOP_PREFIX PATH CLASSPATH

（2）raini@biyuzhe:~$ sudo apt-get install rsync

（3）raini@biyuzhe:~$ sudo apt-get install openssh-server

cd ~/.ssh/   # 若没有该目录，请先执行一次ssh localhost
ssh-keygen -t rsa   # 会有提示，都按回车就可以
cat id_rsa.pub >> authorized_keys # 加入授权
使用ssh localhost试试能否直接登录

（4）raini@biyuzhe:~$ sudo gedit /etc/hosts

127.0.0.1   localhost
127.0.1.1   biyuzhe
#10.155.243.206 biyuzhe
#有的说这里必须修改，否则后面会遇到连接拒绝等问题

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

（5）修改配置文件：etc/hadoop/hadoop-env.sh

   export JAVA_HOME=/home/raini/app/jdk
   export HADOOP_COMMON_HOME=/home/raini/hadoop

（6）raini@biyuzhe:~$ gedit .bashrc

   添加export PATH="/home/raini/hadoop/bin:/home/raini/hadoop/sbin:
   如 export PATH="/home/raini/hadoop/bin:/home/raini/hadoop/sbin:   $JAVA_HOME/:$HADOOP_PREFIX/bin:$PATH"

（7）修改文件etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/home/raini/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
<property>
      <name>io.file.buffer.size</name>
      <value>131072</value>
    </property>
   <property>
       <name>hadoop.proxyuser.master.hosts</name>
        <value>*</value>
   </property>
   <property>
       <name>hadoop.proxyuser.master.groups</name>
       <value>*</value>
   </property>
</configuration>

（8）修改etc/hadoop/hdfs-site.xml:

<configuration>


     <property>
            <name>dfs.namenode.name.dir</name>
            <value>file:/home/raini/hadoop/tmp/dfs/namenode</value>
     </property>
     <property>
            <name>dfs.datanode.data.dir</name>
            <value>file:/home/raini/hadoop/tmp/dfs/datanode</value>
     </property>


     <property>
            <name>dfs.replication</name>
            <value>1</value>
     </property>

     <property>
            <name>dfs.webhdfs.enabled</name>
            <value>true</value>
     </property>
</configuration>

（9）修改配置文件mapred-site.xml

<configuration>

    <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
    </property>

    <property>
          <name>mapreduce.jobhistory.address</name>
          <value>localhost:10020</value>
    </property>

     <property>
          <name>mapreduce.jobhistory.webapp.address</name>
          <value>localhost:19888</value>
     </property>

</configuration>

（10）修改配置文件yarn-site.xml

<configuration>



    <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
    </property>

    <property>
      <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>

    <property>
       <name>yarn.resourcemanager.address</name>
       <value>localhost:8032</value>
    </property>

    <property>
         <name>yarn.resourcemanager.scheduler.address</name>
         <value>localhost:8030</value>
    </property>

    <property>
         <name>yarn.resourcemanager.resource-tracker.address</name>
         <value>localhost:8031</value>
    </property>

     <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>localhost:8033</value>
    </property>

    <property>
         <name>yarn.resourcemanager.webapp.address</name>
         <value>localhost:8088</value>
    </property>

</configuration>

(11)    raini@biyuzhe:~$ source .bashrc

   raini@biyuzhe:~/hadoop$ sbin/start-dfs.sh

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/raini/app/hadoop-2.7.2/logs/hadoop-raini-namenode-biyuzhe.out
biyuzhe: starting datanode, logging to /home/raini/app/hadoop-2.7.2/logs/hadoop-raini-datanode-biyuzhe.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:7Th7Qu6av5WOqmmVLemv3YN+52LAcHw4BuFBNwBt5DU.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/raini/app/hadoop-2.7.2/logs/hadoop-raini-secondarynamenode-biyuzhe.out
raini@biyuzhe:~/hadoop$ jps
14242 Jps
14106 SecondaryNameNode
13922 DataNode------------------(无namenode)

(12) raini@biyuzhe:~/hadoop$ hdfs namenode -format

raini@biyuzhe:~/hadoop$ sbin/stop-dfs.sh
Stopping namenodes on [localhost]
localhost: no namenode to stop
biyuzhe: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode

raini@biyuzhe:~/hadoop$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/raini/app/hadoop-2.7.2/logs/hadoop-raini-namenode-biyuzhe.out
biyuzhe: starting datanode, logging to /home/raini/app/hadoop-2.7.2/logs/hadoop-raini-datanode-biyuzhe.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/raini/app/hadoop-2.7.2/logs/hadoop-raini-secondarynamenode-biyuzhe.out

raini@biyuzhe:~/hadoop$ jps
14919 NameNode-----------------------（namenode）
15407 Jps
15271 SecondaryNameNode
15073 DataNode

（13）raini@biyuzhe:~/hadoop$ sbin/start-yarn.sh

starting yarn daemons
starting resourcemanager, logging to /home/raini/hadoop/logs/yarn-raini-resourcemanager-biyuzhe.out
biyuzhe: starting nodemanager, logging to /home/raini/app/hadoop-2.7.2/logs/yarn-raini-nodemanager-biyuzhe.out
raini@biyuzhe:~/hadoop$ jps
15625 NodeManager
14919 NameNode
15271 SecondaryNameNode
15073 DataNode
15937 Jps
15501 ResourceManager

（14）验证： yarn：http://localhost:8088/

   hadoop: http://localhost:50070

Overview 'localhost:9000' (active)

Started:	Sat Apr 23 14:04:17 CST 2016
Version:	2.7.2, rb165c4fe8a74265c792ce23f546c64604acf0e41
Compiled:	2016-01-26T00:08Z by jenkins from (detached from b165c4f)
Cluster ID:	CID-b0ad8d51-6ea3-4bfc-a1d8-ee0cbc9a8ff6
Block Pool ID:	BP-890697487-127.0.1.1-1461391390144

--------------------------------spark安装

（2）配置Spark环境变量

export SPARK_HOME=/home/raini/spark
export PATH=${SPARK_HOME}/bin:$PATH

（3）配置spark-env.sh
export JAVA_HOME=/home/raini/app/jdk
export SCALA_HOME=/home/raini/app/scala
export SPARK_WORKER_MEMORY=4g
export SPARK_MASTER=biyuzhe
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI=8099
export SPARK_WORKER_CORES=2
export HADOOP_CONF_DIR=/home/raini/hadoop/etc/hadoop

（4）cp slaves.template slaves

#localhost
biyuzhe

spark/sbin/start-all.sh 启动

------------------------mysql----Hive2.0.0安装

1）mysql安装

$sudo apt-get install mysql-server

登录mysql：$mysql -u root -p

建立数据库hive：mysql>create database hive;

                mysql>show databases;//查看创建；

这里一定要把hive数据库的字符集修改为latin1，而且一定要在hive初次启动的时候就修改字符集（否则就等着删除操作的时候死掉吧）
                 mysql>alter database hive character set latin1;

创建hive用户,并授权：mysql>grant all on hive.* to hive@'%' identified by 'hive';

（法二： DROP USER 'hive'@'%';

         mysql> create user 'hive'@'%' identified by 'hive';
       赋予权限 grant all privileges on *.* to 'hive'@'%' with grant option;

     ）

更新：mysql>flush privileges;

查询mysql的版本：mysql>select version();//这里是5.7.11-0ubuntu6

下载mysql的JDBC驱动包： http://dev.mysql.com/downloads/connector/j/

下载mysql-connector-java-5.1.38.tar.gz ，复制msyql的JDBC驱动包到Hive的lib目录下。

2）Hive安装

官网http://hive.apache.org/下载apache-hive-2.0.0-bin.tar.gz并解压在home/hp路径下。

环境配置

添加如下：
#Hive
export HIVE_HOME=/home/raini/app/hive-2.0.0
export PATH=$PATH:${HIVE_HOME}/bin
export CLASSPATH=$CLASSPATH.:{HIVE_HOME}/lib

配置hive-env.sh文件

复制hive-env.sh.template，修改hive-env.sh文件

指定HADOOP_HOME及HIVE_CONF_DIR的路径如下：

HADOOP_HOME=/home/。。/hadoop

export HIVE_CONF_DIR=/home/。。/hive/conf

# export HADOOP_HEAPSIZE=512

# 含有额外的图书馆为蜂巢编译/执行必需的文件夹
# Folder containing extra ibraries required for hive compilation/execution can be controlled by:
export HIVE_AUX_JARS_PATH=/home/raini/app/hive-2.0.0/lib

4）配置hive-site.xml文件

      Hive uses Hadoop, so:

    you must have Hadoop in your path OR
    export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before you can create a table in Hive.

Commands to perform this setup（需要给755权限）:

raini@biyuzhe:~$ hadoop fs -mkdir -p /user/hive/tmp
raini@biyuzhe:~$ hadoop fs -mkdir -p /user/hive/log
raini@biyuzhe:~$ hadoop fs -mkdir -p /user/hive/warehouse
raini@biyuzhe:~$ hadoop fs -chmod g+w   /user/hive/tmp

raini@biyuzhe:~$ hadoop fs -chmod g+w   /user/hive/log
raini@biyuzhe:~$ hadoop fs -chmod g+w   /user/hive/warehouse    /usr/hive/tmp

You may find it useful, though it's not necessary, to set HIVE_HOME:

$ export HIVE_HOME=<hive-install-dir>
    export HIVE_HOME=/home/raini/app/hive

$ sudo /etc/init.d/mysql status

hive/bin下要有->mysql-connector-java-5.1.38-bin.jar

（5）hive配置－设置hive数据将元数据存储在MySQL中， hive需要将元数据存储在RDBMS中，默认情况下，配置为Derby数据库

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>

<property>
   <name>hive.metastore.local</name>
   <value>true</value>
<description>使用本机mysql服务器存储元数据。这种存储方式需要在本地运行一个mysql服务器</description>
</property>

<property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> 
   <description> biyuzhe使用的数据库?charcherEncoding=UTF-8</description>
</property>

<property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>com.mysql.jdbc.Driver</value>
   <description>使用的链接方式</description>
</property>

<property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>hive</value>
   <description>mysql用户名</description>
</property>

<property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>hive</value>
</property>

<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>元数据存放的地放，需要在本地（不是hdfs中）新建这个目录</description>
</property>

<property>
    <name>hive.exec.scratdir</name>
    <value>/user/hive/tmp</value>
    <description> hive的数据临时文件目录，需要在本地新建这个目录HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>

<property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
    <description>这个是用于存放hive相关日志的目录</description>
</property>

<property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
</property>

</configuration>

-------------------------------------------finish hive-site.xml

cp hive-log4j.properties.template hive-log4j.proprties

vi hive-log4j.properties

hive.log.dir=

这个是当hive运行时，相应的日志文档存储到什么地方

（mine：hive.log.dir=/usr/hive/log/${user.name}）

hive.log.file=hive.log

这个是hive日志文件的名字是什么

默认的就可以，只要您能认出是日志就好，

只有一个比较重要的需要修改一下，否则会报错。

log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter

如果没有修改的话会出现：

WARNING: org.apache.hadoop.metrics.EventCounter is deprecated.

please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

（只要按照警告提示修改即可）。

-------------------------------------------------------finish all

hive metastore 服务端启动命令：
hive --service metastore -p <port_num>

raini@biyuzhe:~/app/hive/tmp$ hive --service metastore > /tmp/hive_metastore.log 2>&1 &
[1] 26856

这里Hive中metastore（元数据存储）采用Local方式，非remote方式。

报错：
Exception in thread "main" java.lang.RuntimeException: Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql)

第一次需执行初始化命令$raini@biyuzhe:~$ schematool -dbTypemysql –initSchema
raini@biyuzhe:~$ schematool -initSchema -dbType mysql -userName=hive -passWord=hive

查看初始化后信息$ schematool -dbType mysql –info

启动Hadoop服务：$sbin/start-dfs.sh和$sbin/start-yarn.sh

启动raini@biyuzhe:~/app$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/raini/app/hive2.0.0/lib/hive-jdbc-2.0.0-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/raini/app/hive2.0.0/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/raini/app/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/raini/app/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/home/raini/app/hive2.0.0/lib/hive-common-2.0.0.jar!/hive-log4j2.properties
Sun Apr 24 11:25:41 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Sun Apr 24 11:25:41 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Sun Apr 24 11:25:41 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Sun Apr 24 11:25:41 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Sun Apr 24 11:25:43 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Sun Apr 24 11:25:43 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Sun Apr 24 11:25:43 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Sun Apr 24 11:25:43 CST 2016 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
hive (default)> show databases;
OK
default
Time taken: 1.017 seconds, Fetched: 1 row(s)
hive (default)>

hive (default)> create table test(id int, name string) row format delimited FIELDS TERMINATED BY ',';

报错FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.)

开启metastore：

raini@biyuzhe:~/app$ hive --service metastore
Starting Hive Metastore Server

hive (default)> create table test(id int, name string) row format delimited FIELDS TERMINATED BY ',';
OK
Time taken: 1.613 seconds

可以看到mysql中的元数据信息：

raini@biyuzhe:~$ mysql -u hive -p

Hadoop中查看生成的文件：

raini@biyuzhe:~$ hdfs dfs -ls /user/hive/warehouse/
Found 1 items
drwxrwxrwx - raini supergroup 0 2016-04-24 11:53 /user/hive/warehouse/test

图特摩斯科技

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
4
评论
ubuntu搭建Hadoop+spark+mysql+hive伪分布学习环境

之前使用的ubuntu14.10被我玩坏啦，等了16.04好久终于换上了，体验还不错，就是软件中心不能装软件。下面重新安装一遍Hadoop/Spark的学习环境。都选择了最新的。
复制链接

扫一扫