**说明:**为了方便学习,以下centos使用的用户是root用户,虚拟机与宿主机采取NAT方式通讯
一、大数据主机规划及安装部署
1、主机规划
Zookeeper集群:
192.168.215.154 (niit10)
192.168.215.155 (niit11)
192.168.215.156 (niit12)
Hadoop集群:
192.168.215.154 (niit10) NameNode1 ResourceManager1 Journalnode
192.168.215.155 (niit11) NameNode2 ResourceManager2 Journalnode
192.168.215.156 (niit12) DataNode1 NodeManager1
192.168.215.157 (niit13) DataNode2 NodeManager2
HBase集群:
192.168.215.154 (niit10) HMaster1
192.168.215.155 (niit11) HMaster2
192.168.215.156 (niit12) HRegionServer
192.168.215.157 (niit13) HRegionServer
Flume集群:
192.168.215.154 (niit10)
192.168.215.155 (niit11)
192.168.215.156 (niit12)
192.168.215.157 (niit13)
kafka集群:
192.168.215.155 (niit11)
192.168.215.156 (niit12)
192.168.215.157 (niit13)
hive安装:只安装在niit10上,使用window上的MySQL作为存放hive的元数据的数据库
192.168.215.154 (niit10)
2、zookeeper的集群安装
(1)在niit10上搭建
tar -zxvf zookeeper-3.4.5.tar.gz -C /training/
环境变量(三台主机都需配置)
export ZOOKEEPER_HOME=/training/zookeeper-3.4.5
export PATH=$ZOOKEEPER_HOME/bin:$PATH
核心的配置文件 conf/zoo.cfg
dataDir=/training/zookeeper-3.4.5/tmp 需要事先创建:mkdir tmp
server.1=niit10:2888:3888
server.2=niit02:2888:3888
server.3=niit03:2888:3888
在/training/zookeeper-3.4.5/tmp下创建一个文件 myid
1
将配置好的ZK复制到其他节点上
scp -r zookeeper-3.4.5/ root@niit11:/training
scp -r zookeeper-3.4.5/ root@niit12:/training
修改niit11的myid文件
2
修改niit12上的myid文件
3
3、安装Hadoop集群并实现了HA(在niit10上安装配置)
注意:必须先启动zk集群,否则resourcemanager会起不来的
(1)上传hadoop安装包,解决配置环境变量
tar -zvxf /tools/hadoop-2.7.3.tar.gz -C /training/
同时设置:niit10 niit02 niit03
export HADOOP_HOME=/training/hadoop-2.7.3
epxort PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
(2)mkdir /training/hadoop-2.7.3/tmp
(3)修改core-site.xml
<configuration>
<!-- 指定hdfs的nameservice为ns1 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>
<!-- 指定hadoop临时目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/training/hadoop-2.7.3/tmp</value>
</property>
<!-- 指定zookeeper地址 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>niit10:2181,niit11:2181,niit12:2181</value>
</property>
</configuration>
(4)修改hdfs-site.xml(配置这个nameservice中有几个namenode)
<configuration>
<!--指定hdfs的nameservice为ns1,需要和core-site.xml中的保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>
<!-- ns1下面有两个NameNode,分别是nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.ns1</name>
<value>nn1,nn2</value>
</property>
<!-- nn1的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.ns1.nn1</name>
<value>niit10:9000</value>
</property>
<!-- nn1的http通信地址 -->
<property>
<name>dfs.namenode.http-address.ns1.nn1</name>
<value>niit10:50070</value>
</property>
<!-- nn2的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.ns1.nn2</name>
<value>niit11:9000</value>
</property>
<!-- nn2的http通信地址 -->
<property>
<name>dfs.namenode.http-address.ns1.nn2</name>
<value>niit11:50070</value>
</property>
<!-- 指定NameNode的日志在JournalNode上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://niit10:8485;niit11:8485;/ns1</value>
</property>
<!-- 指定JournalNode在本地磁盘存放数据的位置 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/training/hadoop-2.7.3/journal</value>
</property>
<!-- 开启NameNode失败自动切换 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 配置失败自动切换实现方式 -->
<property>
<name>dfs.client.failover.proxy.provider.ns1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制方法,多个机制用换行分割,即每个机制占用一行-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>
sshfence
shell(/bin/true)
</value>
</property>
<!-- 使用sshfence隔离机制时需要ssh免登陆 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- 配置sshfence隔离机制超时时间 -->
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>30000</value>
</property>
</configuration>
(5)修改mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
(6)修改yarn-site.xml
<configuration>
<!-- 开启RM高可靠 -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<!-- 指定RM的cluster id -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yrc</value>
</property>
<!-- 指定RM的名字 -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<!-- 分别指定RM的地址 -->
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>niit10</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>niit11</value>
</property>
<!-- 指定zk集群地址 -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>niit10:2181,niit11:2181,niit12:2181</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
(7)修改slaves:
niit12
niit13
(8)将配置好的hadoop拷贝到其他节点
scp -r /training/hadoop-2.7.3/ root@niit11:/training/
scp -r /training/hadoop-2.7.3/ root@niit12:/training/
scp -r /training/hadoop-2.7.3/ root@niit13:/training/
(9)启动Zookeeper集群,即在每台安装有zk的主机上启动zk:
zkServer.sh start
(10)在niit10和niit11上启动journalnode
hadoop-daemon.sh start journalnode
(11)格式化HDFS(在niit10上执行)
1. hdfs namenode -format
日志:common.Storage: Storage directory /training/hadoop-2.7.3/tmp/dfs/name has been successfully formatted.
2. 将/training/hadoop-2.7.3/tmp拷贝到niit11的/training/hadoop-2.7.3/tmp下
scp -r tmp/ root@niit11:/training/hadoop-2.7.3/
3. 格式化zookeeper
hdfs zkfc -formatZK
日志:17/07/13 00:34:33 INFO ha.ActiveStandbyElector: Successfully created /hadoop-ha/ns1 in ZK.
(12)在niit10上启动Hadoop集群
start-all.sh
日志:
Starting namenodes on [niit10 niit11]
niit11: starting namenode, logging to /training/hadoop-2.7.3/logs/hadoop-root-namenode-niit11.out
niit10: starting namenode, logging to /training/hadoop-2.7.3/logs/hadoop-root-namenode-niit10.out
niit13: starting datanode, logging to /training/hadoop-2.7.3/logs/hadoop-root-datanode-niit13.out
niit12: starting datanode, logging to /training/hadoop-2.7.3/logs/hadoop-root-datanode-niit12.out
Starting journal nodes [niit10 niit11 ]
niit11: journalnode running as process 9249. Stop it first.
niit10: journalnode running as process 10871. Stop it first.
Starting ZK Failover Controllers on NN hosts [niit10 niit11]
niit10: starting zkfc, logging to /training/hadoop-2.7.3/logs/hadoop-root-zkfc-niit10.out
niit11: starting zkfc, logging to /training/hadoop-2.7.3/logs/hadoop-root-zkfc-niit11.out
starting yarn daemons
starting resourcemanager, logging to /training/hadoop-2.7.3/logs/yarn-root-resourcemanager-niit10.out
niit13: starting nodemanager, logging to /training/hadoop-2.7.3/logs/yarn-root-nodemanager-niit13.out
niit12: starting nodemanager, logging to /training/hadoop-2.7.3/logs/yarn-root-nodemanager-niit12.out
(13)niit11上的ResourceManager需要单独启动
yarn-daemon.sh start resourcemanager
4、 HBase的集群安装与HA的实现
4.1、全分布模式(有四台台主机:niit10(主节点) niit11(备份的主节点) niit12(从节点) niit13(从节点))
特别注意:如果没有特别说明的,以下所有的操作默认都是在主节点(niit10)上进行的
1)上传hbase-1.3.1-bin.tar.gz到/tools目录下
2)将hbase-1.3.1-bin.tar.gz文件解压并安装到/training目录下
tar -zvxf hbase-1.3.1-bin.tar.gz -C /training/
3)配置环境变量(这里需要在四台主机上都要配置)
vi ~/.bash_profile
在打开的.bash_profile文件中添加如下信息:
export HBASE_HOME=/training/hbase-1.3.1
export PATH=$HBASE_HOME/bin:$PATH
4)让环境变量生效
source ~/.bash_profile
5)验证配置hbase的环境变量是否生效
hbase ----- 看看是否有Usage: hbase [<options>] <command> [<args>]信息,如果有则生效了,否则,配置有误
6)进入到/training/hbase-1.3.1/conf目录下
cd /training/hbase-1.3.1/conf
在该目录下找到如下文件进行修改:
(a)vi hbase-env.sh,修改如下信息:
(1)找到# export JAVA_HOME=/usr/java/jdk1.6.0/这句代码,将#号去掉,将/usr/java/jdk1.6.0改成你自己的JAVA_HOME路径
我本机的JAVA_HOME的路径是:/training/jdk1.8.0_171,所以我修改好之后的样子如下:
export JAVA_HOME=/training/jdk1.8.0_171
(2)找到# export HBASE_MANAGES_ZK=true 将#号去掉即可 将true 改成false
(3)保存退出
(b)vi hbase-site.xml,在<configuration></configuration>之间添加如下信息,
注意下面的有IP的地址需要修改成你自己主机的IP地址:
<!--HBase的数据保存在HDFS对应目录-->
<property>
<name>hbase.rootdir</name>
<value>hdfs://niit10:9000/hbase</value>
</property>
<!--是否是分布式环境-->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<!--配置ZK的地址-->
<property>
<name>hbase.zookeeper.quorum</name>
<value>niit10</value>
</property>
<!--冗余度-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!--主节点和从节点允许的最大时间误差-->
<property>
<name>hbase.master.maxclockskew</name>
<value>180000</value>
</property>
(c)vi regionservers,配置从节点地址:
将localhost改成IP地址或者主机名:
niit12
niit13
7)进入到/training目录下,将hbase-1.3.1整个目录复制到其他两个从节点上:
scp -r hbase-1.3.1/ root@niit11:/training
scp -r hbase-1.3.1/ root@niit12:/training
scp -r hbase-1.3.1/ root@niit13:/training
7)启动HBase:
start-hbase.sh
8)使用jps命令查看,是否已经启动了如下三个进程(如果以下上个进程存在,则说明安装配置hbase成功了):
HRegionServer
HMaster
9)登录Web Console进程查看:
http://niit10:16010
4.2、HBase实现HA(在以上全分布模式的基础上,任意选择niit02或者niit03作为HMaster即可)
假设我选择niit02作为备份的HMaster的话,我需要在niit02上手动启动一个HMaster进程即可:
hbase-daemon.sh start master
5、kafka集群安装(选择niit11 ~ niit13三台主机安装kafka)
注意:没有特别说明,以下的安装操作均在niit11上进行
1.准备zk
已安装忽略,没有安装,请参考上面安装步骤
2.jdk
已安装忽略,没有安装,请安装
3.tar文件
tar -zvxf kafka_2.11-0.10.0.1.tgz -C /training/
4.环境变量(niit11 ~ niit13三台主机都需要配置)
vi ~/.bash_profile
添加如下信息:
export KAFKA_HOME=/training/kafka_2.11-0.10.0.1
export PATH=$PATH:$KAFKA_HOME/bin
5、新建logs目录:
mkdir /training/kafka_2.11-0.10.0.1/logs
6.配置kafka
配置文件路径:[/training/kafka_2.11-0.10.0.1/server.properties]
找到如下信息,并修改:
broker.id=11
listeners=PLAINTEXT://:9092
log.dirs=/training/kafka_2.11-0.10.0.1/logs
zookeeper.connect=niit10:2181,niit11:2181,niit12:2181
6.分发kafka_2.11-0.10.0.1环境到niit12、niit13上,同时修改每个文件的broker.id:
scp -r /training/kafka_2.11-0.10.0.1 root@niit12:/training/
scp -r /training/kafka_2.11-0.10.0.1 root@niit13:/training/
修改niit12上的server.properties,将broker.id改成12
修改niit13上的server.properties,将broker.id改成13
7.启动kafka服务器(每台都需要单独启动)
a)先启动zk(如果没有启动zk)
b)启动kafka(niit11 ~ niit13),进入到/training/kafka_2.11-0.10.0.1/bin目录下,执行:
./kafka-server-start.sh config/server.properties & ---->以后台进程方式启动
c)验证kafka服务器是否启动
netstat -anop | grep 9092
8.创建主题 ,在/training/kafka_2.11-0.10.0.1/bin目录下,执行:
./kafka-topics.sh --create --zookeeper niit11:2181 --replication-factor 3 --partitions 3 --topic test
9.查看主题列表,在/training/kafka_2.11-0.10.0.1/bin目录下,执行:
./kafka-topics.sh --list --zookeeper niit11:2181
10.启动控制台生产者,在/training/kafka_2.11-0.10.0.1/bin目录下,执行:
./kafka-console-producer.sh --broker-list niit11:9092 --topic test
11.启动控制台消费者,在/training/kafka_2.11-0.10.0.1/bin目录下,执行:
./kafka-console-consumer.sh --bootstrap-server niit12:9092 --topic test --from-beginning --zookeeper niit11:2181
12.在生产者控制台输入hello world,消费者就可以看到生成输入的信息了
6、Flume集群安装与部署(niit10 ~ niit13)
以下所有的操作默认都是在niit10上进行的
1).上传flume到/tools目录下
2).解压
tar -zvxf apache-flume-1.7.0-bin.tar.gz -C /training/
3).环境变量
export FLUME_HOME=/training/apache-flume-1.7.0-bin
export PATH=$PATH:$FLUME_HOME/bin
4)、在/training/apache-flume-1.7.0-bin/conf目录下,创建与kafka的配置文件:eshop.conf
先创建flume的spoolDir:mkdir /training/nginx-1.14.0/logs/flume
vi eshop.conf
添加如下内容:
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /training/nginx-1.14.0/logs/flume
a1.sources.r1.fileHeader = true
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.kafka.topic = eshop
a1.sinks.k1.kafka.bootstrap.servers = niit11:9092 niit12:9092 niit13:9092
a1.channels.c1.type = memory
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
5)、分发flume环境到其他主机上
scp -r apache-flume-1.7.0-bin root@niit11:/training/
scp -r apache-flume-1.7.0-bin root@niit12:/training/
scp -r apache-flume-1.7.0-bin root@niit13:/training/
6).验证flume是否成功
flume-ng version //next generation.下一代.
7)、启动
a)在niit10~niit12上启动zk集群(如没有启动)
$>zkServer.sh start //niit10
$>zkServer.sh start //niit11
$>zkServer.sh start //niit12
b)在niit11~niit13上启动kafka集群(如没有启动)
kafka-server-start.sh /training/kafka_2.11-0.10.0.1/config/server.properties & --->以后台进程方式启动
c)在niit11创建eshop主题
kafka-topics.sh --zookeeper niit11:2181 --topic eshop --create --partitions 3 --replication-factor 3
d)在niit11查看主题
$>kafka-topics.sh --zookeeper niit11:2181 --list
e)启动flume(每台主机都需要启动)
//niit10~niit13
cd /training/apache-flume-1.7.0-bin/conf
flume-ng agent -f eshop.conf -n a1
7、测试flume采集数据,送到kafka中,验证:
1)命令行方式,启动消费者,读消息进行消费:
./kafka-console-consumer.sh --bootstrap-server niit11:9092 --topic test --from-beginning --zookeeper niit11:2181
2)编写原生消费者程序对消息进行消费
注意事项:
(*)权限 : hdfs dfs -chmod 777 -R /eshop_logs/raw
8、安装hive:
1.上传hive2.1-tar.gz 到/tools目录下
2.tar开
$>tar -xzvf hive-2.1.0.tar.gz -C /training //tar开
$>cd /soft/hive-2.1.0 //
$>ln -s hive-2.1.0 hive //符号连接
3.配置环境变量
[/etc/profile]
export HIVE_HOME=/training/hive
export PATH=$PATH:$HIVE_HOME/bin
4.验证hive安装成功
$>hive --v
5.配置hive,使用win7的mysql存放hive的元数据.
a)复制mysql驱动程序到hive的lib目录下。
b)进入到/training/hive/conf/,配置hive-site.xml
复制hive-default.xml.template为hive-site.xml
cp hive-default.xml.template hive-site.xml
或者新建hive-site.xml
修改连接信息为mysql链接地址,将${system:...字样替换成具体路径,将doAs属性改成false
[hive/conf/hive-site.xml]
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.215.1:3306/hive?useSSL=false</value> <!--集群环境中配置虚拟网卡即可-->
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<!--配置本地Hive执行jobs的目录-->
<property>
<name>hive.exec.local.scratchdir</name>
<value>/home/hive/scratchdir</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/home/hive/downloads</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
<property>
<name>hive.querylog.location</name>
<value>/home/hive/querylogs</value>
<description>Location of Hive run time structured log file</description>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/home/hive/server2_logs</value>
<description>Top level directory where operation logs are stored if logging functionality is enabled</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
<description>
Setting this property to true will have HiveServer2 execute
Hive operations as the user making the calls to it.
</description>
</property>
</configuration>
6)在msyql中创建存放hive信息的数据库
mysql>create database hive ;
7)初始化hive的元数据(表结构)到mysql中。
$>cd /training/hive/bin
$>schematool -dbType mysql -initSchema
8)hive的操作
在hive中创建分区表,使用y=m=d=h=m=
-------------------
1.创建数据库.
$>hive
$hive>create database eshop ;
2.创建hive的分区表 (这里使用external建分区表是不对的,应该去掉,external创建外部表的关键字)
create external table eshop.logs (
hostname string,
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string
)
partitioned by(year int ,month int,day int,hour int,minute int)
row format DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
3.创建centos的cron作业,周期添加表分区。
使用的添加分区语句
hive>alter table eshop.logs add partition(year=2018,month=8,day=23,hour=17,minute=03)
3.mysql查看hive表分区
$mysql>select * from hive.partitions ;
4.创建脚本,添加表分区脚本。
[/usr/local/bin/addpar.sh]
#!/bin/bash
y=`date +%Y`
m=`date +%m`
d=`date +%d`
h=`date +%H`
mi=`date +%M`
hive -e "alter table eshop.logs add partition(year=${y},month=${m},day=${d},hour=${h},minute=${mi})"
5.添加centos调出
date -d "1 day" +%Y%m%d ---》一天后
date -d "-1 days" +%Y%m%d ---》一天前
date -d "1 month" +%Y%m%d ---》一个月后
date -d '-1 hours' +%H ---》一个小时前
date -d '-1 minutes' +%M ---》一分钟前
6./etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin:/soft/jdk/bin:/soft/hadoop/sbin:/soft/hadoop/bin:/soft/hive/bin
MAILTO=root
# For details see man 4 crontabs
# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
* * * * * root rolllog.sh
* * * * * centos addpar.sh
7.测试load数据到hive表
$>hive
//LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
$hive>load data inpath 'hdfs://mycluster/user/centos/eshop/raw/2017/02/28/17/35/s203.log' into table eshop.logs partition(year=2017,month=3,day=1,hour=11,minute=31)
如果不是集群的话:load data inpath "hdfs://niit00:9000/electricity/raw/2018/08/29/07/58/niit00.log" into table logs partition(year=2018,month=8,day=29,hour=7,minute=58);
8.修改hive表使用,号进行分割(这不我不需要做,因为我之前就是按照逗号分隔的)。
create external table eshop.logs (
hostname string,
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string
)
partitioned by(year int ,month int,day int,hour int,minute int)
row format DELIMITED
FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
9、sqoop的安装与配置
1)、下载sqoop-1.4.6.tar.gz和sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
2)、上传到/tools目录下
3)、解压
(1)tar -zvxf sqoop-1.4.6.tar.gz -C /training/
(2)配置环境变量:
export SQOOP_HOME=/training/sqoop-1.4.6
export PATH=$PATH:$SQOOP_HOME/bin
(3)让环境变量生效:source ~/.bash_profile
(4)进入到/training/sqoop/conf/目录下
cp sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh
修改对应的选项:
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/training/hadoop-2.7.3
#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOMEi=/training/hadoop-2.7.3
#set the path to where bin/hbase is available
export HBASE_HOME=/training/hbase-1.3.1
#Set the path to where bin/hive is available
export HIVE_HOME=/training/hive
#Set the path for where zookeper config dir is
export ZOOCFGDIR=/training/zookeeper-3.4.5
(5)将mysql的驱动程序mysql-connector-java-5.1.44-bin.jar上传到/training/sqoop/lib目录下
(6)进入到/tools目录下,解压sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
tar -zvxf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
cd /tools/sqoop-1.4.6.bin__hadoop-2.0.4-alpha
将sqoop-1.4.6.jar复制到/training/sqoop/lib/目录下(如果没有一步,会报错: 找不到或无法加载主类 org.apache.sqoop.Sqoop)
cp sqoop-1.4.6.jar /training/sqoop/lib/
4)、测试sqoop命令
sqoop help
不报错,即安装正确
5)以下部分是使用sqoop,hive进行清理:
使用hive load hdfs上的清洗的数据。
----------------------------------
1.动态添加表分区
alter table eshop.logs add partition(year=2018,month=08,day=29,hour=03,minute=21);
2.load数据到表中。
$hive>load data inpath '/electricity/raw/2018/08/29/03/21' into table eshop.logs partition(year=2018,month=08,day=29,hour=03,minute=21);
3.查询topN
$hive>select * from logs ;
s201 192.168.231.1 - 02/Mar/2017:09:28:58 +0800 GET /eshop/phone/mi.html HTTP/1.0 200 213 -ApacheBench/2.3 - 2017 3 2 9 28
//倒排序topN
$>select request,count(*) as c from logs where year = 2018 and month = 08 and day = 29 and hour = 03 and minute = 21 group by request order by c desc ;
4.创建统计结果表
$hive>create table stats(request string,c int) row format DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
$hive>insert into stats select request,count(*) as c from logs where year = 2018 and month = 3 and day = 2 and hour = 9 and minute = 28 group by request order by c desc ;
insert into stats select request,count(*) as c from logs where year = 2018 and month = 8 and day = 29 and hour = 3 and minute = 21 group by request order by c desc ;
5.使用sqoop将hive中的数据导出到mysql
sqoop export --connect jdbc:mysql://192.168.215.1:3306/eshop --driver com.mysql.jdbc.Driver --username root --password 123456 --table stats --columns request,c --export-dir hdfs://ns1/user/hive/warehouse/eshop.db/stats
sqoop export --connect jdbc:mysql://192.168.215.1:3306/eshop --driver com.mysql.jdbc.Driver --username root --password 123456 --table stats --columns request,c --export-dir hdfs://niit00:9000/user/hive/warehouse/eshop.db/stats
注意:
1)需要在niit10上先安装sqoop
2)需要在MySQL中先创建stats表
3)上面的命令会变成一个MapReduce任务提交到HDFS上去执行
4)--export-dir HDFS地址我这里是高可用环境,所以配置的是集群名称,如果不是则是namenode所在主机
6.将以上2-5部写成脚本,使用cron进行调度.
a.描述
每天的凌晨2点整,统计昨天的日志。
b.创建bash脚本
1.创建准备脚本.创建hive脚本文件。
[/usr/local/bin/prestats.sh]
#!/bin/bash
y=`date +%Y`
m=`date +%m`
d=`date -d "-0 day" +%d` ###-0 表示当天 -1 表示昨天 1表示明天 这个地方可以自定义
#将03格式化成3格式
m=$(( m + 0 ))
d=$(( d + 0 ))
#
rm -rf stat.ql
#添加分区
echo "alter table eshop.logs add if not exists partition(year=${y},month=${m},day=${d},hour=10,minute=47);" >> stat.ql
#加载数据放到分区
echo "load data inpath 'hdfs://ns1/eshop/raw/${y}/${m}/${d}/10/47' into table eshop.logs partition(year=${y},month=${m},day=${d},hour=10,minute=47);" >> stat.ql
#统计数据,并将结果插入到stats表
echo "insert into eshop.stats select request,count(*) as c from eshop.logs where year = ${y} and month = ${m} and day = ${d} and hour=10 and minute = 47 group by request order by c desc ;" >> stat.ql
2.创建执行脚本
[/usr/local/bin/exestats.sh]
#!/bin/bash
./prestats.sh
#调用hive的ql脚本
hive -f stat.ql
#执行sqoop导出
sqoop export --connect jdbc:mysql://192.168.215.1:3306/eshop --driver com.mysql.jdbc.Driver --username root --password 123456 --table stats --columns request,c --export-dir hdfs://ns1/user/hive/warehouse/eshop.db/stats
3.修改所有权限
$>sudo chmod a+x /usr/local/bin/prestats.sh
$>sudo chmod a+x /usr/local/bin/exestats.sh
4、执行测试
5、存在问题:
mysql> select * from stats;
+----------------------------------------+------+
| request | c |
+----------------------------------------+------+
| GET /eshop/phone/iphone7.html HTTP/1.0 | 3 |
| GET /eshop/phone/mi.html HTTP/1.0 | 1133 |
| GET /eshop/phone/note7.html HTTP/1.0 | 6 |
| GET /eshop/phone/huawei.html HTTP/1.0 | 2 |
| GET /eshop/phone/iphone7.html HTTP/1.0 | 3 |
| GET /eshop/phone/mi.html HTTP/1.0 | 1133 |
| GET /eshop/phone/huawei.html HTTP/1.0 | 2 |
| GET /eshop/phone/note7.html HTTP/1.0 | 6 |
| GET /eshop/phone/note7.html HTTP/1.0 | 1611 |
| GET /eshop/images/huawei.png HTTP/1.0 | 102 |
| GET /eshop/phone/iphone7.html HTTP/1.0 | 3 |
| GET /eshop/phone/mi.html HTTP/1.0 | 1133 |
| GET /eshop/phone/huawei.html HTTP/1.0 | 2 |
| GET /eshop/phone/note7.html HTTP/1.0 | 6 |
+----------------------------------------+------+
导出消费者jar包,放到centos上运行
1.通过mvn命令,使用依赖方式下载运行的所有第三包。
mvn -DoutputDirectory=./lib -DgroupId=com.it18zhang -DartifactId=EshopConsumer -Dversion=1.0-SNAPSHOT dependency:copy-dependencies
2.使用idea打包项目
3.复制项目jar到lib下.
4.在centos上使用xargs处理文本
$>ls | xargs > a.txt
5.替换所有空格 -> :
6.java -cp ... com.it18zhang.kafkconsumer.HDFSRawConsumer
mvn -DoutputDirectory=./lib -DgroupId=com.niit -DartifactId=RawConsumer -Dversion=1.0-SNAPSHOT dependency:copy-dependencies
二、window下Nginx的安装与配置
说明:window作为反向代理
1.在win7安装nginx(需要下载window版本的Nginx)
解压即可(任何目录即可)。
c:\nginx-1.14.0
双击 nginx.exe 启动nginx服务器
2、测试:在浏览器中输入 http://localhost:80
三、Linux下web环境的搭建
1、Nginx的安装(Nginx作为静态资源服务器与负载均衡服务器)
1)、上传Nginx安装包nginx-1.14.0.tar.tz到/tools目录下
2)、在centos上安装nginx
a)先安装G++
yum install gcc
b)安装pcre
方法一、通过yum安装(推荐)
yum install pcre-static.x86_64
方法二、手动通过源代码编译安装(不使用)
$>tar -xzvf pcre-8.32.tar.tz -C ~
$>cd ~/pcre-8.32
$>sudo ./configure --prefix=/training/pcre-8.32
3)、安装nginx
方法一、通过yum安装(无yum源)
yum install nginx
方法二、手动通过源代码编译安装(使用该项)
$>tar -xzvf nginx-1.14.0.tar.tz -C ~
$>cd ~/nginx-1.14.0
$>sudo ./configure --prefix=/training/nginx-1.14.0 --without-http_gzip_module
$>sudo make && make install
$>sudo ldconfig ---->ldconfig通常在系统启动时运行,当用户安装了一个新的动态链接库时,就需要手工运行这个命令.
4)、配置环境变量
vi ~/.bash_profile
添加如下信息:
export PATH=$PATH:/training/nginx-1.14.0/sbin
生效:
source ~/.bash_profile
5)、启动nginx服务器
cd /training/nginx-1.14.0/sbin
./nginx //启动服务器,这个可以编写一个脚本,下面内容有
6)、Nginx启停命令
./nginx -s stop //停止服务器
./nginx -s reload //重新加载服务器
./nginx -s reopen //重新打开服务器
./nginx -s quit //退出服务器
7)、通过浏览器访问nginx网页,出现nginx欢迎页面。
http://niit10:80/
注意:以下8)~10)步骤的操作是编写脚本,目的是方便日后安装等
8)、nginx安装配置脚本:(集群中每台都需要安装Nginx,执行下面的脚本即可,sudo命令是指如果不是root用户的话 需要sudo)
---------------------
在~目录下新建 vi nginx_install.sh 添加如下信息:
#!/bin/bash
yum install -y gcc
#install pcre
yum install -y pcre-static.x86_64
#install nginx
cp /tools/nginx-1.14.0.tar.gz ~
cd ~
tar -xzvf nginx-1.14.0.tar.gz -C .
cd nginx-1.14.0
./configure --prefix=/training/nginx-1.14.0 --without-http_gzip_module
make && make install
ldconfig
#set .bash_profile
echo '#nginx' >> ~/.bash_profile
echo 'export PATH=$PATH:/training/nginx-1.14.0/sbin' >> ~/.bash_profile
source ~/.bash_profile
9)、 Nginx的启动脚本:
vi start-nginx.sh 添加如下信息:
#!/bin/bash
cd /training/nginx-1.14.0/sbin
./nginx
10)、 Nginx的停止脚本:
vi stop-nginx.sh 添加如下信息:
#!/bin/bash
cd /training/nginx-1.14.0/sbin
./nginx -s stop
2、Tomcat的集群安装(所有主机)
1)、上传tomcat到/tools目录下
2)、解压:tar -zvxf apache-tomcat-7.0.90.tar.gz -C /training/
3)、配置环境变量
#tomcat
export PATH=$PATH:/training/apache-tomcat-7.0.90/bin
4)、生效:
source ~/.bash_profile
四、web项目的编写
注意:这个项目是模拟电商平台,产生日志数据,作为大数据平台数据的来源
具体代码请查看源码
1、搭建spring + springMVC + hibernate环境
2、编写具体业务代码
3、编写前端代码
注意:前端采用H5来写,数据的获取是通过ajax方式请求
五、部署web项目到Linux集群环境中(在以上所有步骤完成的情况下)
1、web项目打成war上传到所有主机tomcat下的webapp目录下(如何打包,自己百度吧)
2、前端项目goods上传到Nginx安装目录下的html目录下即可
六、编写Linux下的crontab调度任务脚本(放在脚本文件目录下)
1、使用centos的cron机制实现nginx的日志滚动
注意以下的脚本中,Nginx路径需要修改成你自己的
在/usr/local/bin/目录下新建 rolllog.sh
#!/bin/bash
#
dataformat=date +%Y-%m-%d-%H-%M
#
cp /soft/nginx-1.6.3/logs/access.log /soft/nginx-1.6.3/logs/access_KaTeX parse error: Double superscript at position 48: … sed -i 's/^/'̲{host}’,&/g’ /soft/nginx-1.6.3/logs/access_KaTeX parse error: Expected 'EOF', got '#' at position 18: …taformat.log #̲ lines=`wc -l…dataformat.log#move access-xxx.log flume's spooldir mv /soft/nginx-1.6.3/logs/access_$dataformat.log /soft/nginx-1.6.3/logs/flume #delete rows sed -i '1,'${lines}'d' /soft/nginx-1.6.3/logs/access.log #reboot nginx , otherwise log can not roll. kill -USR1
cat /soft/nginx/logs/nginx.pid`
2、设置日志滚动时间为每一分钟(为了测试效果)
vi /etc/crontab
* * * * * root rolllog.sh
3、编写hive脚本
七、编写kafka的原生消费者
1、代码
2、编写分区表的脚本
八、使用sqoop将hive清洗的数据导入MySQL数据库中
1、编写脚本
2、添加cron定时调度(一天一次)
九、在web项目中进行可视化
1、看看代码