一、hadoop大数据平台搭建
分布式:分布式是把一个服务项目分解成n个小服务项目,便于项目开发维护。
0.hadoop简介
hadoop是一个分布式大数据基础架构,核心组件是hdfs和mepreduce,用于海量数据(一般大于500G就可以考虑)的离线和实时计算。
0.1hadoop基本进程
1)NameNode它是hadoop中的主服务器,管理文件系统名称空间和对集群中存储的文件的访问,保存有metadate。
2)SecondaryNameNode它不是namenode的冗余守护进程,而是提供周期检查点和清理任务。帮助NN合并editslog,减少NN启动时间。
3)DataNode它负责管理连接到节点的存储(一个集群中可以有多个节点)。每个存储数据的节点运行一个datanode守护进程。
4)ResourceManager(JobTracker)JobTracker负责调度DataNode上的工作。每个DataNode有一个TaskTracker,它们执行实际工作。
5)NodeManager(TaskTracker)执行任务
6)DFSZKFailoverController高可用时它负责监控NN的状态,并及时的把状态信息写入ZK。它通过一个独立线程周期性的调用NN上的一个特定接口来获取NN的健康状态。FC也有选择谁作为Active NN的权利,因为最多只有两个节点,目前选择策略还比较简单(先到先得,轮换)。
7)JournalNode 高可用情况下存放namenode的editlog文件.
1.环境准备
Centos 7
1.1集群规划
主机名 ip 进程
hd1 192.168.174.20 DataNode、NameNode
hd2 192.168.174.21 DataNode、SecondaryNameNode
hd3 192.168.174.22 DataNode
1.2初始化集群服务器
1 安装jdk
2 永久关闭防火墙
3 禁用selinux
4 时钟同步
5 设置最大文件打开数
1.2.1服务器初始化脚本
#!/bin/bash
echo "安装jdk..."
num=`rpm -qa | grep jdk | wc -l`
if [ $num -eq 1 ]
then
java -version
else
rpm -ivh /root/jdk-8u371-linux-x64.rpm
java -version
fi
echo "*********开始检查时区同步*********"
sleep 5
yum -y install ntpdate >/dev/null
num1=`crontab -l | grep ntpdate | wc -l`
if [ ${num1} -eq 0 ]
then
echo "每隔十分钟同步阿里云主机时间"
(crontab -l;echo "*/10 * * * * /usr/sbin/ntpdate ntp1.aliyun.com") | crontab
fi
crontab -l
echo "*********开始检查selinux*********"
sleep 5
num2=`cat /etc/selinux/config | grep SELINUX=enforcing | wc -l`
if [ ${num2} -eq 1 ]
then
echo "没有关闭selinux,正在关闭...."
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
fi
cat /etc/selinux/config
echo "*********开始检查防火墙*********"
sleep 5
num3=`systemctl status firewalld.service | grep running | wc -l`
if [ ${num3} -eq 1 ]
then
echo "没有关闭防火墙,正在关闭...."
systemctl stop firewalld.service
echo "正在禁用防火墙开机自启...."
systemctl disable firewalld.service
fi
systemctl status firewalld.service
echo "*********开始检查最大文件数*********"
sleep 5
num4=`cat /etc/security/limits.conf | grep 65535 | wc -l`
if [ ${num4} -eq 0 ]
then
echo "开始设置最大文件数为65535,请重启"
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf
fi
echo "已经设置成最大文件打开数65535"
1.3ssh互信免密
原理:
1)服务器A使用ssh-key-gen生成密钥对
2)A将公钥拷贝到B的~/.ssh/authorized_keys文件中
3)Assh访问B时发送用私钥加密的数据
4)B用公钥解密,解密成功后返回用数据并用公钥加密
5)A收到数据后用私钥解密成功则登录成功
hd1中执行
1.ssh-key-gen
2.cd ~/.ssh
3.cat id_rsa.pub >> authorized_keys
4.scp -r ~/.ssh hd2:~/.ssh
5.scp -r ~/.ssh hd3:~/.ssh
2.安装zookeeper
1)上传并解压zookeeper
[root@hd1 ~]# tar -zxvf apache-zookeeper-3.7.1-bin.tar.gz
2)配置环境变量(三台都需要配置zookeeper的bin目录到PATH中并且生效)
[root@hd1 ~]# vi ~/.bash_profile
[root@hd1 ~]# source ~/.bash_profile
3)修改zookeeper的配置文件
在配置文件末尾添加
3888后不要多空格
4)创建dataDir目录(三台都需要创建)
[root@hd1 ~]# mkdir -p /root/local/zk
5)分发zookeeper到2、3虚拟机中
scp -r /root/apache-zookeeper-3.7.1-bin hd2:/root
scp -r /root/apache-zookeeper-3.7.1-bin hd3:/root
6)创建myid文件(因为zookeeper也是一个集群,我们这里是3台,对外通信的是leader角色的zookeeper虚拟机,leader是由每台虚拟机中的myid文件中的数字投票选举出来了【投票机制https://www.cnblogs.com/shuaiandjun/p/9383655.html】,需要注意的是,hadoop1、2、3中的数字需要和zoo.cg中的server.n保持一致)
ssh hd1 ‘echo 1 > /root/local/zk/myid’
ssh hd2 ‘echo 2 > /root/local/zk/myid’
ssh hd3 ‘echo 3 > /root/local/zk/myid’
7)启动zookeeper进程(3台服务器都启动)
3.安装hadoop、hdfs
1 .上传并解压hadoop-3.1.3.tar.gz到hadoop1(node1)中
2 配置hadoop的配置文件 hdfs-site.xml core-site.xml workers hadoop-env.sh(都在/root/hadoop- 3.1.3/etc/hadoop目录下)
workers(运行datanode进程服务器的列表)
hd1
hd2
hd3
hadoop-env.sh
#找到下面一行,解开注释,配置jdk的路径
export JAVA_HOME=/root/jdk1.8.0_212
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/local/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hd1:2181,hd2:2181,hd3:2181</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!--指定hdfs的nameservice为mycluster,需要和core-site.xml中的保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!-- mycluster下面有两个NameNode,分别是nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!-- nn1的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hd1:8020</value>
</property>
<!-- nn2的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hd2:8020</value>
</property>
<!-- nn1的http通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hd1:50070</value>
</property>
<!-- nn2的http通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hd2:50070</value>
</property>
<!-- 指定NameNode的edits元数据在JournalNode上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hd1:8485;hd2:8485;hd3:8485/mycluster</value>
</property>
<!-- 指定JournalNode在本地磁盘存放数据的位置 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/root/local/jn</value>
</property>
<!-- 配置失败自动切换实现方式 -->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制方法,即同一时刻只能有一台服务器对外响应-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用sshfence隔离机制时需要ssh免登陆 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- 开启NameNode失败自动切换 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
</configuration>
3 修改/root/hadoop-3.1.3/sbin/start-dfs.sh和/root/hadoop-3.1.3/sbin/start-dfs.sh命令 ,在文本首部添加
HDFS_NAMENODE_USER=root
HDFS_DATANODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
HDFS_JOURNALNODE_USER=root
HDFS_ZKFC_USER=root
4 分发hd1中的hadoop整个目录到hd2和hd3服务器上
scp -r hadoop-3.1.4/ hd2:/root
5需要三台虚拟机分别创建hadoop.tmp.dir目录
mkdir -p /root/local/tmp
6配置hadoop的环境变量(可以的话在三台中都配置一下)
vi ~/.bash_profile
#HADOOP
export HADOOP_HOME=/root/hadoop-3.1.4
export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
生效
source ~/.bash_profile
7 格式化namenode(在hd1中执行)
hdfs namenode -format
8启动zookeeper集群
excute.sh hdfs "zkServer.sh start"
9初始化zookeeper并重启zookeeper集群 (在hd1中执行)
hdfs zkfc -formatZK
10启动hdfs(在hd1中执行)
start-dfs.sh
11查看进程
4.安装hive、yarn
1)修改yarn-site.xml配置文件(在/root/hadoop-3.1.3/etc/hadoop目录下,需要改成你自己的主机名)
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster1</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop1</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop2</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>hadoop1</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>hadoop2</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop3</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
2)修改mapred-site.xml(注意路径的修改)
[root@node1 hadoop]# cp mapred-site.xml.template mapred-site.xml
[root@node1 hadoop]# vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/root/hadoop-3.1.3</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/root/hadoop-3.1.3</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/root/hadoop-3.1.3</value>
</property>
</configuration>
3)分发node1的配置文件yarn-site.xml和mapred-site.xml到node2和node3上
[root@hadoop1 hadoop]# scp -r mapred-site.xml hadoop2:/root/hadoop-3.1.3/etc/hadoop
mapred-site.xml
100% 1202 1.8MB/s 00:00
[root@hadoop1 hadoop]# scp -r mapred-site.xml hadoop3:/root/hadoop-3.1.3/etc/hadoop
mapred-site.xml
100% 1202 1.4MB/s 00:00
[root@hadoop1 hadoop]# scp -r yarn-site.xml hadoop2:/root/hadoop-3.1.3/etc/hadoop
yarn-site.xml
100% 2034 1.9MB/s 00:00
[root@hadoop1 hadoop]# scp -r yarn-site.xml hadoop3:/root/hadoop-3.1.3/etc/hadoop
yarn-site.xml
4)启动yarn了(启动yarn之前需要启动hdfs整个集群)
[root@hadoop1 sbin]# vi /root/hadoop-3.1.3/sbin/start-yarn.sh
添加以下内容
YARN_RESOURCEMANAGER_USER=root
YARN_NODEMANAGER_USER=root
[root@node1 hadoop]# start-yarn.sh
部署hive
1)上传hive包和mysql的驱动包(mysql-connector-java-5.1.48.jar)
2)解压hive包
3)将mysql的驱动包(mysql-connector-java-5.1.48.jar)移动到hive的lib目录下
4)配置hive的环境变量
[root@hadoop1 ~]# vi ~/.bash_profile
添加一下内容
#hive
export HIVE_HOME=/root/apache-hive-3.1.2-bin
export PATH=$PATH:${HIVE_HOME}/bin
source ~/.bash_profile
5)配置hive的配置文件(注意主机名改改,密码改改)
[root@hadoop1 ~]# cd apache-hive-3.1.2-bin/conf/
[root@hadoop1 conf]# touch hive-site.xml
[root@hadoop1 conf]# vi hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop1:3306/metastore?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop1:9083</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop1</value>
</property>
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
</configuration>
6)在mysql中创建存放hive元数据【hive的核心数据】的库metastore
[root@node1 conf]# mysql -uroot -p'123456'
mysql> create database metastore charset utf8;
Query OK, 1 row affected (0.00 sec)
需要加权限,允许hive使用root用户密码是123456登录node1上的数据库
mysql> grant all privileges on *.* to root@'%' identified by '123456';
Query OK, 0 rows affected, 1 warning (0.04 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.01 sec)
7)初始化元数据库(如果报错了)
[root@node1 conf]# schematool -initSchema -dbType mysql -verbose
备注:
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)
这个报错自己解决:
方法是替换guava版本
8)启动metastore服务
[root@node1 ~]# nohup hive --service metastore & 【直接回车即可】
9)启动hive(启动hive之前需要先启动整个集群)
[root@node1 ~]# hive
which: no hbase in
(/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/jdk1.8.0_212/bin:/root
/hadoop-2.7.7/bin:/root/hadoop-2.7.7/sbin:/root/apache-zookeeper-3.5.7-
bin/bin:/root/bin:/root/jdk1.8.0_212/bin:/root/hadoop-2.7.7/bin:/root/hadoop-
2.7.7/sbin:/root/apache-zookeeper-3.5.7-bin/bin:/root/apache-hive-2.3.4-bin/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-
2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hadoop-2.7.7/share/hadoop/common/lib/slf4jlog4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/root/apache-hive-2.3.4-bin/lib/hivecommon-2.3.4.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions.
Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>
验证
5.安装datax数据采集工具
6.自动化数据采集
/root/datax/job/job.json
{
"job": {
"setting": {
"speed": {
"byte":10485760
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column" : [
"id",
"order_id",
"order_status",
"operate_time"
],
"splitPk": "id",
"connection": [
{
"table": [
"order_status_log20200902"
],
"jdbcUrl": [
"jdbc:mysql://localhost:3306/wzy"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://192.168.174.20:8020",
"fileType": "text",
"path": "/user/hive/warehouse/wzy.db/order_status_log20200902",
"fileName": "order_status_log20200902",
"column": [
{
"name": "id",
"type": "STRING"
},
{
"name": "order_id",
"type": "STRING"
},
{
"name": "order_status",
"type": "STRING"
},
{
"name": "operate_time",
"type": "STRING"
}
],
"writeMode": "append",
"fieldDelimiter": ",",
"compress": "GZIP"
}
}
}
]
}
}
自动采集数据库数据脚本
#!/bin/bash
start_date=`date -d "$1" +%s`
end_date=`date -d "$2" +%s`
if [ $start_date -gt $end_date ]
then
echo "Error:$1>$2"
exit 0
else
for ((i=$start_date;i<=$end_date;i=i+86400))
do
postfix=`date -d @$i +%Y%m%d`
hive -e "use wzy;create table IF NOT EXISTS order_status_log${postfix}(id string,order_id string,order_status string,operate_time string) row format delimited fields terminated by ',';"
str=`grep log /root/datax/job/job.json|head -n 1|cut -d '_' -f 3 | cut -c 4-11`
sed -i "s/${str}/${postfix}/g" /root/datax/job/job.json
python /root/datax/bin/datax.py /root/datax/job/job.json
done
fi
验证:
涉及脚本:
1服务器初始化脚本
#!/bin/bash
echo "安装jdk..."
num=`rpm -qa | grep jdk | wc -l`
if [ $num -eq 1 ]
then
java -version
else
rpm -ivh /root/jdk-8u371-linux-x64.rpm
java -version
fi
echo "*********开始检查时区同步*********"
sleep 5
yum -y install ntpdate >/dev/null
num1=`crontab -l | grep ntpdate | wc -l`
if [ ${num1} -eq 0 ]
then
echo "每隔十分钟同步阿里云主机时间"
(crontab -l;echo "*/10 * * * * /usr/sbin/ntpdate ntp1.aliyun.com") | crontab
fi
crontab -l
echo "*********开始检查selinux*********"
sleep 5
num2=`cat /etc/selinux/config | grep SELINUX=enforcing | wc -l`
if [ ${num2} -eq 1 ]
then
echo "没有关闭selinux,正在关闭...."
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
fi
cat /etc/selinux/config
echo "*********开始检查防火墙*********"
sleep 5
num3=`systemctl status firewalld.service | grep running | wc -l`
if [ ${num3} -eq 1 ]
then
echo "没有关闭防火墙,正在关闭...."
systemctl stop firewalld.service
echo "正在禁用防火墙开机自启...."
systemctl disable firewalld.service
fi
systemctl status firewalld.service
echo "*********开始检查最大文件数*********"
sleep 5
num4=`cat /etc/security/limits.conf | grep 65535 | wc -l`
if [ ${num4} -eq 0 ]
then
echo "开始设置最大文件数为65535,请重启"
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf
fi
echo "已经设置成最大文件打开数65535"
2集群执行脚本
#!/bin/bash
server=${1}
cmd=${2}
conf=/root/cluster.conf
if [ -f ${conf} ]
then
for ip in `cat ${conf} | grep -i ${server} | cut -d "," -f 1`
do
echo "----------${ip}------------"
ssh ${ip} "source ~/.bash_profile;${cmd}"
done
else
echo "${conf}不存在,请检查"
fi
3集群分发脚本
cluster.conf文本
192.168.174.10,SERVER1,SERVER2,SERVER3
192.168.174.11,SERVER1,SERVER2,SERVER3
192.168.174.12,SERVER1,SERVER2,SERVER3
192.168.174.13,SERVER1,SERVER2
192.168.174.14,SERVER1
192.168.174.15,SERVER1
#!/bin/bash
server=${1}
src=${2}
url=${3}
conf=/root/cluster.conf
if [ -f ${conf} ]
then
for ip in `cat ${conf} | grep -i ${server} | cut -d "," -f 1`
do
echo "----------${ip}------------"
scp -r ${src} root@${ip}:${url}
done
else
echo "${conf}不存在,请检查"
fi
4数据库备份
#!/bin/bash
dblist=`mysql -uroot -p123456 -s -e "show databases;" | grep -v Database`
date=`date +%F`
for db in ${dblist}
do
tables=`mysql -uroot -p123456 -s -e "use ${db};show tables;" | grep -v Tables_in_`
for table in ${tables}
do
mkdir -p /root/${date}/${db}
mysqldump -uroot -p123456 ${db} ${table} > /root/${date}/${db}/${table}.spl
done
done
每晚24点备份一次
5采集的文本的日志数据
#!/bin/bash
#日志文件存放的目录,需手动创建
log_src_dir=/export/data/logs/log/
#待上传文件存放的目录,需手动创建
log_toupload_dir=/export/data/logs/toupload/
date1=`date -d last-day +%Y_%m_%d`
hdfs_root_dir=/data/clickLog/%date1/
echo "envs:hadoop_home:$HADOOP_HOME"
echo "log_scr_dir:"$log_src_dir
ls $log_src_dir | while read fileName
do
if [[ "${fileName}" == access.log.* ]]
then
date=`date +%Y_%m_%d_%H_%M_%S`
echo "moving $log_src_dir$fileName to
$log_toupload_dir"xxxxx_click_log_$fileName"$date"
mv $log_src_dir$fileName $log_toupload_dir"xxxxx_click_log_$fileName"$date
echo $log_toupload_dir"xxxxx_click_log_$fileName"$date >>
$log_toupload_dir"willDoing."$date
fi
done
ls $log_toupload_dir | grep will |grep -v "_COPY_" | grep -v "_DONE_" | while read line
do
echo "toupload is in file:"$line
mv $log_toupload_dir$line $log_toupload_dir$line"_COPY_"
cat $log_toupload_dir$line"_COPY_" |while read line
do
echo "outing..$line to hdfs path.....$hdfs_root_dir"
hadoop fs -mkdir -p $hdfs_root_dir
hadoop fs -put $line $hdfs_root_dir
done
mv $log_toupload_dir$line"_COPY_" $log_toupload_dir$line"_DONE_"
done
6自动采集数据库数据脚本
#!/bin/bash
start_date=`date -d "$1" +%s`
end_date=`date -d "$2" +%s`
if [ $start_date -gt $end_date ]
then
echo "Error:$1>$2"
exit 0
else
for ((i=$start_date;i<=$end_date;i=i+86400))
do
postfix=`date -d @$i +%Y%m%d`
hive -e "use wzy;create table IF NOT EXISTS order_status_log${postfix}(id string,order_id string,order_status string,operate_time string) row format delimited fields terminated by ',';"
str=`grep log /root/datax/job/job.json|head -n 1|cut -d '_' -f 3 | cut -c 4-11`
sed -i "s/${str}/${postfix}/g" /root/datax/job/job.json
python /root/datax/bin/datax.py /root/datax/job/job.json
done
fi
7、安装数据库脚本
#!/bin/bash
#put the mysql tar package under basedir:/usr/local
#datadir=/data/mysql change it at line 45
#basedir=/usr/local change it at line 25
echo -e "if you are not first install mysql\n
please Execute next cmd \n
and put the mysql tar package under basedir\n
and Execute Script agin\n
service mysql stop or\n
systemctl stop mysql\n
rm -rf /data/mysql /etc/my.cnf\n"
#install -y libaio
echo "===============install libaio...=================="
yum list installed | grep libaio > /tmp/check.txt
if test -s /tmp/check.txt
then
echo -e "\nlibaio already installed"
echo -e "\ncomplete"
else
yum -y install libaio
echo -e "\ncomplete"
fi
#install mysql
echo "===============uzip mysql...=================="
basedir=/usr/local
if test -d ${basedir}/mysql
then
echo -e "\nmysql already uzip"
echo -e "\ncomplete"
else
tar -zxf ${basedir}/mysql-5.7.11-linux-glibc2.5-x86_64.tar.gz -C ${basedir}
mv ${basedir}/mysql-5.7.11-linux-glibc2.5-x86_64 ${basedir}/mysql
echo -e "\ncomplete"
fi
#creat mysql user
echo "===============creat mysql user=================="
id mysql
if [ $? -eq 1 ];then
echo -e "\ncreat mysql user"
groupadd mysql
useradd -r -g mysql mysql
echo -e "\ncomplete"
else
echo -e "\nmysql user has created by anothor one"
fi
#mk datadir
echo "===============make mysql datadir=================="
datadir=/data/mysql
if [ -d ${datadir} ];then
echo "/data/mysql has already exited\nplease drop or point a new datadir"
exit 0
else
mkdir -p /data/mysql
chown mysql:mysql -R /data/mysql
echo -e "\ncomplete"
fi
ln -s ${basedir}/mysql/bin/* /usr/bin
#set server_id
echo "===============set server_id=================="
server_id=$RANDOM
echo -e "\ncomplete"
#generate my.cnf file
echo "===============generate my.cnf file=================="
cat >>/etc/my.cnf<<EOF
[mysqld]
bind-address=0.0.0.0
port=3306
user=mysql
skip-grant-tables
basedir=${basedir}/mysql
datadir=/data/mysql
socket=/tmp/mysql.sock
log-error=${datadir}/mysql.err
pid-file=${datadir}/mysql.pid
#character config
character_set_server=utf8mb4
symbolic-links=0
explicit_defaults_for_timestamp=true
EOF
#initialize data
echo "===============initialize data=================="
mysqld --defaults-file=/etc/my.cnf --basedir=${basedir}/mysql/ --datadir=${datadir} --user=mysql --initialize
echo -e "\ncomplete"
#start mysql
echo "===============start mysql...=================="
cp ${basedir}/mysql/support-files/mysql.server /etc/init.d/mysql
sleep 5
service mysql start
echo -e "\ncomplete"
#set passwd
echo "===============set passwd:123456=================="
mysql -e "use mysql;update user set host = '%' where user = 'root';FLUSH PRIVILEGES;"
mysql -e "SET PASSWORD = PASSWORD('123456');ALTER USER 'root'@'localhost' PASSWORD EXPIRE NEVER;FLUSH PRIVILEGES;"
service mysql restart
sed -i "s/skip-grant-tables/#skip-grant-tables/g"/etc/my.cnf
echo -e "\ncomplete"
chkconfig --add mysql
chkconfig mysql on
echo -e "\nMYSQL INSTALL SUCCESS!"