hadoop大数据分析平台

一、hadoop大数据平台搭建

分布式:分布式是把一个服务项目分解成n个小服务项目,便于项目开发维护。

0.hadoop简介

hadoop是一个分布式大数据基础架构,核心组件是hdfs和mepreduce,用于海量数据(一般大于500G就可以考虑)的离线和实时计算。

0.1hadoop基本进程

1)NameNode它是hadoop中的主服务器,管理文件系统名称空间和对集群中存储的文件的访问,保存有metadate。

2)SecondaryNameNode它不是namenode的冗余守护进程,而是提供周期检查点和清理任务。帮助NN合并editslog,减少NN启动时间。

3)DataNode它负责管理连接到节点的存储(一个集群中可以有多个节点)。每个存储数据的节点运行一个datanode守护进程。

4)ResourceManager(JobTracker)JobTracker负责调度DataNode上的工作。每个DataNode有一个TaskTracker,它们执行实际工作。

5)NodeManager(TaskTracker)执行任务

6)DFSZKFailoverController高可用时它负责监控NN的状态,并及时的把状态信息写入ZK。它通过一个独立线程周期性的调用NN上的一个特定接口来获取NN的健康状态。FC也有选择谁作为Active NN的权利,因为最多只有两个节点,目前选择策略还比较简单(先到先得,轮换)。

7)JournalNode 高可用情况下存放namenode的editlog文件.

1.环境准备

Centos 7

1.1集群规划

主机名   ip    					进程
hd1    192.168.174.20      DataNode、NameNode
hd2    192.168.174.21	   DataNode、SecondaryNameNode
hd3    192.168.174.22      DataNode

1.2初始化集群服务器

1 安装jdk

2 永久关闭防火墙

3 禁用selinux

4 时钟同步

5 设置最大文件打开数

1.2.1服务器初始化脚本
#!/bin/bash
echo "安装jdk..."
num=`rpm -qa | grep jdk | wc -l`
if [ $num -eq 1 ]
then
  java -version
else
  rpm -ivh /root/jdk-8u371-linux-x64.rpm
  java -version
fi
echo "*********开始检查时区同步*********"
sleep 5
yum -y install ntpdate >/dev/null
num1=`crontab -l | grep ntpdate | wc -l`
if [ ${num1} -eq 0 ]
then
echo "每隔十分钟同步阿里云主机时间"
(crontab -l;echo "*/10 * * * * /usr/sbin/ntpdate ntp1.aliyun.com") | crontab
fi
crontab -l
echo "*********开始检查selinux*********"
sleep 5
num2=`cat /etc/selinux/config | grep SELINUX=enforcing | wc -l`
if [ ${num2} -eq 1 ]
then
echo "没有关闭selinux,正在关闭...."
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
fi
cat /etc/selinux/config
echo "*********开始检查防火墙*********"
sleep 5
num3=`systemctl status firewalld.service | grep running | wc -l`
if [ ${num3} -eq 1 ]
then
echo "没有关闭防火墙,正在关闭...."
systemctl stop firewalld.service
echo "正在禁用防火墙开机自启...."
systemctl disable firewalld.service
fi
systemctl status firewalld.service
echo "*********开始检查最大文件数*********"
sleep 5
num4=`cat /etc/security/limits.conf | grep 65535 | wc -l`
if [ ${num4} -eq 0 ]
then
echo "开始设置最大文件数为65535,请重启"
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf
fi
echo "已经设置成最大文件打开数65535"

1.3ssh互信免密

原理:

在这里插入图片描述

1)服务器A使用ssh-key-gen生成密钥对

2)A将公钥拷贝到B的~/.ssh/authorized_keys文件中

3)Assh访问B时发送用私钥加密的数据

4)B用公钥解密,解密成功后返回用数据并用公钥加密

5)A收到数据后用私钥解密成功则登录成功

hd1中执行

1.ssh-key-gen
2.cd ~/.ssh
3.cat id_rsa.pub >> authorized_keys
4.scp -r ~/.ssh hd2:~/.ssh
5.scp -r ~/.ssh hd3:~/.ssh

2.安装zookeeper

1)上传并解压zookeeper

[root@hd1 ~]# tar -zxvf apache-zookeeper-3.7.1-bin.tar.gz

2)配置环境变量(三台都需要配置zookeeper的bin目录到PATH中并且生效)

[root@hd1 ~]# vi ~/.bash_profile

在这里插入图片描述

[root@hd1 ~]# source ~/.bash_profile

3)修改zookeeper的配置文件

在这里插入图片描述

在配置文件末尾添加

在这里插入图片描述

3888后不要多空格

4)创建dataDir目录(三台都需要创建)

[root@hd1 ~]# mkdir -p /root/local/zk

5)分发zookeeper到2、3虚拟机中

scp -r /root/apache-zookeeper-3.7.1-bin hd2:/root

scp -r /root/apache-zookeeper-3.7.1-bin hd3:/root

6)创建myid文件(因为zookeeper也是一个集群,我们这里是3台,对外通信的是leader角色的zookeeper虚拟机,leader是由每台虚拟机中的myid文件中的数字投票选举出来了【投票机制https://www.cnblogs.com/shuaiandjun/p/9383655.html】,需要注意的是,hadoop1、2、3中的数字需要和zoo.cg中的server.n保持一致)

ssh hd1 ‘echo 1 > /root/local/zk/myid’

ssh hd2 ‘echo 2 > /root/local/zk/myid’

ssh hd3 ‘echo 3 > /root/local/zk/myid’

在这里插入图片描述
7)启动zookeeper进程(3台服务器都启动)

3.安装hadoop、hdfs

1 .上传并解压hadoop-3.1.3.tar.gz到hadoop1(node1)中

2 配置hadoop的配置文件 hdfs-site.xml core-site.xml workers hadoop-env.sh(都在/root/hadoop- 3.1.3/etc/hadoop目录下)

workers(运行datanode进程服务器的列表)

hd1
hd2
hd3

hadoop-env.sh

#找到下面一行,解开注释,配置jdk的路径
export JAVA_HOME=/root/jdk1.8.0_212

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/root/local/tmp</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hd1:2181,hd2:2181,hd3:2181</value>
</property>
</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<!--指定hdfs的nameservice为mycluster,需要和core-site.xml中的保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<!-- mycluster下面有两个NameNode,分别是nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>
<!-- nn1的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>hd1:8020</value>
</property>
<!-- nn2的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>hd2:8020</value>
</property>
<!-- nn1的http通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>hd1:50070</value>
</property>
<!-- nn2的http通信地址 -->
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>hd2:50070</value>
</property>
<!-- 指定NameNode的edits元数据在JournalNode上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hd1:8485;hd2:8485;hd3:8485/mycluster</value>
</property>
<!-- 指定JournalNode在本地磁盘存放数据的位置 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/root/local/jn</value>
</property>
<!-- 配置失败自动切换实现方式 -->
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制方法,即同一时刻只能有一台服务器对外响应-->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用sshfence隔离机制时需要ssh免登陆 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<!-- 开启NameNode失败自动切换 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
</configuration>

3 修改/root/hadoop-3.1.3/sbin/start-dfs.sh和/root/hadoop-3.1.3/sbin/start-dfs.sh命令 ,在文本首部添加

HDFS_NAMENODE_USER=root
HDFS_DATANODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
HDFS_JOURNALNODE_USER=root
HDFS_ZKFC_USER=root

4 分发hd1中的hadoop整个目录到hd2和hd3服务器上

scp -r hadoop-3.1.4/ hd2:/root

5需要三台虚拟机分别创建hadoop.tmp.dir目录

mkdir -p /root/local/tmp

6配置hadoop的环境变量(可以的话在三台中都配置一下)

vi ~/.bash_profile
#HADOOP
export HADOOP_HOME=/root/hadoop-3.1.4
export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin

生效

source ~/.bash_profile

7 格式化namenode(在hd1中执行)

hdfs namenode -format

8启动zookeeper集群

excute.sh hdfs "zkServer.sh start"

9初始化zookeeper并重启zookeeper集群 (在hd1中执行)

hdfs zkfc -formatZK

10启动hdfs(在hd1中执行)

start-dfs.sh

11查看进程

在这里插入图片描述

4.安装hive、yarn

1)修改yarn-site.xml配置文件(在/root/hadoop-3.1.3/etc/hadoop目录下,需要改成你自己的主机名)

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>cluster1</value>
</property>
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>hadoop1</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>hadoop2</value>
</property>
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value>
</property>
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>hadoop1</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address.rm2</name>
<value>hadoop2</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>hadoop3</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>

2)修改mapred-site.xml(注意路径的修改)

[root@node1 hadoop]# cp mapred-site.xml.template mapred-site.xml
[root@node1 hadoop]# vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/root/hadoop-3.1.3</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/root/hadoop-3.1.3</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/root/hadoop-3.1.3</value>
</property>
</configuration>

3)分发node1的配置文件yarn-site.xml和mapred-site.xml到node2和node3上

[root@hadoop1 hadoop]# scp -r mapred-site.xml hadoop2:/root/hadoop-3.1.3/etc/hadoop
mapred-site.xml
100% 1202 1.8MB/s 00:00
[root@hadoop1 hadoop]# scp -r mapred-site.xml hadoop3:/root/hadoop-3.1.3/etc/hadoop
mapred-site.xml
100% 1202 1.4MB/s 00:00
[root@hadoop1 hadoop]# scp -r yarn-site.xml hadoop2:/root/hadoop-3.1.3/etc/hadoop
yarn-site.xml
100% 2034 1.9MB/s 00:00
[root@hadoop1 hadoop]# scp -r yarn-site.xml hadoop3:/root/hadoop-3.1.3/etc/hadoop
yarn-site.xml

4)启动yarn了(启动yarn之前需要启动hdfs整个集群)

[root@hadoop1 sbin]# vi /root/hadoop-3.1.3/sbin/start-yarn.sh
添加以下内容
YARN_RESOURCEMANAGER_USER=root
YARN_NODEMANAGER_USER=root
[root@node1 hadoop]# start-yarn.sh

部署hive

1)上传hive包和mysql的驱动包(mysql-connector-java-5.1.48.jar)

2)解压hive包

3)将mysql的驱动包(mysql-connector-java-5.1.48.jar)移动到hive的lib目录下

4)配置hive的环境变量

[root@hadoop1 ~]# vi ~/.bash_profile
添加一下内容
#hive
export HIVE_HOME=/root/apache-hive-3.1.2-bin
export PATH=$PATH:${HIVE_HOME}/bin

source ~/.bash_profile

5)配置hive的配置文件(注意主机名改改,密码改改)

[root@hadoop1 ~]# cd apache-hive-3.1.2-bin/conf/
[root@hadoop1 conf]# touch hive-site.xml
[root@hadoop1 conf]# vi hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hadoop1:3306/metastore?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<name>datanucleus.schema.autoCreateAll</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop1:9083</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>hadoop1</value>
</property>
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
</configuration>

6)在mysql中创建存放hive元数据【hive的核心数据】的库metastore

[root@node1 conf]# mysql -uroot -p'123456'
mysql> create database metastore charset utf8;
Query OK, 1 row affected (0.00 sec)
需要加权限,允许hive使用root用户密码是123456登录node1上的数据库
mysql> grant all privileges on *.* to root@'%' identified by '123456';
Query OK, 0 rows affected, 1 warning (0.04 sec)
mysql> flush privileges;
Query OK, 0 rows affected (0.01 sec)

7)初始化元数据库(如果报错了)

[root@node1 conf]# schematool -initSchema -dbType mysql -verbose
备注:
com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)
这个报错自己解决:
方法是替换guava版本

8)启动metastore服务

[root@node1 ~]# nohup hive --service metastore & 【直接回车即可】

9)启动hive(启动hive之前需要先启动整个集群)

[root@node1 ~]# hive
which: no hbase in
(/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/root/jdk1.8.0_212/bin:/root
/hadoop-2.7.7/bin:/root/hadoop-2.7.7/sbin:/root/apache-zookeeper-3.5.7-
bin/bin:/root/bin:/root/jdk1.8.0_212/bin:/root/hadoop-2.7.7/bin:/root/hadoop-
2.7.7/sbin:/root/apache-zookeeper-3.5.7-bin/bin:/root/apache-hive-2.3.4-bin/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/apache-hive-2.3.4-bin/lib/log4j-slf4j-impl-
2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/hadoop-2.7.7/share/hadoop/common/lib/slf4jlog4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/root/apache-hive-2.3.4-bin/lib/hivecommon-2.3.4.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions.
Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>

验证

在这里插入图片描述

5.安装datax数据采集工具

6.自动化数据采集

/root/datax/job/job.json

{
    "job": {
        "setting": {
            "speed": {
                "byte":10485760
            },
            "errorLimit": {
                "record": 0,
                "percentage": 0.02
            }
        },
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                    "username": "root",
                    "password": "123456",
                        "column" : [
                                        "id",
                                        "order_id",
                                        "order_status",
                                        "operate_time"
                        ],
                        "splitPk": "id",
                        "connection": [
                                        {
                        "table": [
                                "order_status_log20200902"
                        ],
                        "jdbcUrl": [
                                "jdbc:mysql://localhost:3306/wzy"
                                ]
                            }
                        ]
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "defaultFS": "hdfs://192.168.174.20:8020",
                        "fileType": "text",
                        "path": "/user/hive/warehouse/wzy.db/order_status_log20200902",
                        "fileName": "order_status_log20200902",
                        "column": [
                                {
                                        "name": "id",
                                        "type": "STRING"
                                },
                                {

                                        "name": "order_id",
                                        "type": "STRING"
                                },
                                {
                                        "name": "order_status",
                                        "type": "STRING"
                                },
                                {
                                        "name": "operate_time",
                                        "type": "STRING"
                                }
                        ],
                                        "writeMode": "append",
                                        "fieldDelimiter": ",",
                                        "compress": "GZIP"
                    }
                }
            }
        ]
    }
}

自动采集数据库数据脚本

#!/bin/bash
start_date=`date -d "$1" +%s`
end_date=`date -d "$2" +%s`
if [ $start_date -gt $end_date ]
then
        echo "Error:$1>$2"
        exit 0
else
        for ((i=$start_date;i<=$end_date;i=i+86400))
        do
          postfix=`date -d @$i +%Y%m%d`
          hive -e "use wzy;create table IF NOT EXISTS order_status_log${postfix}(id string,order_id string,order_status string,operate_time string) row format delimited fields terminated by ',';"
          str=`grep log /root/datax/job/job.json|head -n 1|cut -d '_' -f 3 | cut -c 4-11`
          sed -i "s/${str}/${postfix}/g"  /root/datax/job/job.json
          python /root/datax/bin/datax.py /root/datax/job/job.json
        done
fi

验证:

在这里插入图片描述
在这里插入图片描述

涉及脚本:

1服务器初始化脚本

#!/bin/bash
echo "安装jdk..."
num=`rpm -qa | grep jdk | wc -l`
if [ $num -eq 1 ]
then
  java -version
else
  rpm -ivh /root/jdk-8u371-linux-x64.rpm
  java -version
fi
echo "*********开始检查时区同步*********"
sleep 5
yum -y install ntpdate >/dev/null
num1=`crontab -l | grep ntpdate | wc -l`
if [ ${num1} -eq 0 ]
then
echo "每隔十分钟同步阿里云主机时间"
(crontab -l;echo "*/10 * * * * /usr/sbin/ntpdate ntp1.aliyun.com") | crontab
fi
crontab -l
echo "*********开始检查selinux*********"
sleep 5
num2=`cat /etc/selinux/config | grep SELINUX=enforcing | wc -l`
if [ ${num2} -eq 1 ]
then
echo "没有关闭selinux,正在关闭...."
sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
fi
cat /etc/selinux/config
echo "*********开始检查防火墙*********"
sleep 5
num3=`systemctl status firewalld.service | grep running | wc -l`
if [ ${num3} -eq 1 ]
then
echo "没有关闭防火墙,正在关闭...."
systemctl stop firewalld.service
echo "正在禁用防火墙开机自启...."
systemctl disable firewalld.service
fi
systemctl status firewalld.service
echo "*********开始检查最大文件数*********"
sleep 5
num4=`cat /etc/security/limits.conf | grep 65535 | wc -l`
if [ ${num4} -eq 0 ]
then
echo "开始设置最大文件数为65535,请重启"
echo "* soft nofile 65535" >> /etc/security/limits.conf
echo "* hard nofile 65535" >> /etc/security/limits.conf
fi
echo "已经设置成最大文件打开数65535"

2集群执行脚本

#!/bin/bash
server=${1}
cmd=${2}
conf=/root/cluster.conf
if [ -f ${conf} ]
then
for ip in `cat ${conf} | grep -i ${server} | cut -d "," -f 1`
do
echo "----------${ip}------------"
ssh ${ip} "source ~/.bash_profile;${cmd}"
done
else
echo "${conf}不存在,请检查"
fi

3集群分发脚本

cluster.conf文本

192.168.174.10,SERVER1,SERVER2,SERVER3
192.168.174.11,SERVER1,SERVER2,SERVER3
192.168.174.12,SERVER1,SERVER2,SERVER3
192.168.174.13,SERVER1,SERVER2
192.168.174.14,SERVER1
192.168.174.15,SERVER1
#!/bin/bash
server=${1}
src=${2}
url=${3}
conf=/root/cluster.conf
if [ -f ${conf} ]
then
for ip in `cat ${conf} | grep -i ${server} | cut -d "," -f 1`
do
echo "----------${ip}------------"
scp -r ${src} root@${ip}:${url}
done
else
echo "${conf}不存在,请检查"
fi

4数据库备份

#!/bin/bash
dblist=`mysql -uroot -p123456 -s -e "show databases;" | grep -v Database`
date=`date +%F`
for db in ${dblist}
do
	tables=`mysql -uroot -p123456 -s -e "use ${db};show tables;" | grep -v Tables_in_`
	for table in ${tables}
		do
			mkdir -p /root/${date}/${db}
			mysqldump -uroot -p123456 ${db} ${table} > 				  /root/${date}/${db}/${table}.spl
		done
done

每晚24点备份一次

在这里插入图片描述

5采集的文本的日志数据

#!/bin/bash
#日志文件存放的目录,需手动创建
log_src_dir=/export/data/logs/log/
#待上传文件存放的目录,需手动创建
log_toupload_dir=/export/data/logs/toupload/
date1=`date -d last-day +%Y_%m_%d`
hdfs_root_dir=/data/clickLog/%date1/
echo "envs:hadoop_home:$HADOOP_HOME"
echo "log_scr_dir:"$log_src_dir
ls $log_src_dir | while read fileName
do
if [[ "${fileName}" == access.log.* ]]
then
date=`date +%Y_%m_%d_%H_%M_%S`
echo "moving $log_src_dir$fileName to
$log_toupload_dir"xxxxx_click_log_$fileName"$date"
mv $log_src_dir$fileName $log_toupload_dir"xxxxx_click_log_$fileName"$date
echo $log_toupload_dir"xxxxx_click_log_$fileName"$date >>
$log_toupload_dir"willDoing."$date
fi
done
ls $log_toupload_dir | grep will |grep -v "_COPY_" | grep -v "_DONE_" | while read line
do
echo "toupload is in file:"$line
mv $log_toupload_dir$line $log_toupload_dir$line"_COPY_"
cat $log_toupload_dir$line"_COPY_" |while read line
do
echo "outing..$line to hdfs path.....$hdfs_root_dir"
hadoop fs -mkdir -p $hdfs_root_dir
hadoop fs -put $line $hdfs_root_dir
done
mv $log_toupload_dir$line"_COPY_" $log_toupload_dir$line"_DONE_"
done

6自动采集数据库数据脚本

#!/bin/bash
start_date=`date -d "$1" +%s`
end_date=`date -d "$2" +%s`
if [ $start_date -gt $end_date ]
then
        echo "Error:$1>$2"
        exit 0
else
        for ((i=$start_date;i<=$end_date;i=i+86400))
        do
          postfix=`date -d @$i +%Y%m%d`
          hive -e "use wzy;create table IF NOT EXISTS order_status_log${postfix}(id string,order_id string,order_status string,operate_time string) row format delimited fields terminated by ',';"
          str=`grep log /root/datax/job/job.json|head -n 1|cut -d '_' -f 3 | cut -c 4-11`
          sed -i "s/${str}/${postfix}/g"  /root/datax/job/job.json
          python /root/datax/bin/datax.py /root/datax/job/job.json
        done
fi

7、安装数据库脚本

#!/bin/bash
#put the mysql tar package under basedir:/usr/local
#datadir=/data/mysql change it at line 45 
#basedir=/usr/local change it at line 25
 


echo -e "if you are not first install mysql\n
please Execute next cmd \n
and put the mysql tar package under basedir\n
and Execute Script agin\n
service mysql stop or\n
systemctl stop mysql\n
rm -rf /data/mysql /etc/my.cnf\n"

#install -y libaio
echo "===============install libaio...=================="
yum list installed | grep libaio > /tmp/check.txt
if test -s /tmp/check.txt
then
    echo -e "\nlibaio already installed"
    echo -e "\ncomplete"
else
    yum -y install libaio
    echo -e "\ncomplete"
fi 

#install mysql
echo "===============uzip mysql...=================="
basedir=/usr/local
if test -d ${basedir}/mysql
then
    echo -e "\nmysql already uzip"
    echo -e "\ncomplete"
else
	tar -zxf ${basedir}/mysql-5.7.11-linux-glibc2.5-x86_64.tar.gz -C ${basedir}
	mv ${basedir}/mysql-5.7.11-linux-glibc2.5-x86_64 ${basedir}/mysql
	echo -e "\ncomplete"
fi

#creat mysql user
echo "===============creat mysql user=================="
id mysql
if [ $? -eq 1 ];then
	echo -e "\ncreat mysql user"
	groupadd mysql
	useradd -r -g mysql mysql
	echo -e "\ncomplete"
else
	echo -e "\nmysql user has created by anothor one"
fi

#mk datadir
echo "===============make mysql datadir=================="
datadir=/data/mysql
if [ -d ${datadir} ];then
	echo "/data/mysql has already exited\nplease drop or point a new datadir"
	exit 0
else
	mkdir -p /data/mysql
	chown mysql:mysql -R /data/mysql
	echo -e "\ncomplete"
fi
ln -s  ${basedir}/mysql/bin/*    /usr/bin

#set server_id
echo "===============set server_id=================="
server_id=$RANDOM
echo -e "\ncomplete"

#generate my.cnf file
echo "===============generate my.cnf file=================="
cat >>/etc/my.cnf<<EOF
[mysqld]
bind-address=0.0.0.0
port=3306
user=mysql
skip-grant-tables
basedir=${basedir}/mysql
datadir=/data/mysql
socket=/tmp/mysql.sock
log-error=${datadir}/mysql.err
pid-file=${datadir}/mysql.pid
#character config
character_set_server=utf8mb4
symbolic-links=0
explicit_defaults_for_timestamp=true
EOF

#initialize data
echo "===============initialize data=================="
mysqld --defaults-file=/etc/my.cnf --basedir=${basedir}/mysql/ --datadir=${datadir} --user=mysql --initialize
echo -e "\ncomplete"

#start mysql 
echo "===============start mysql...=================="
cp ${basedir}/mysql/support-files/mysql.server /etc/init.d/mysql
sleep 5
service mysql start
echo -e "\ncomplete"

#set passwd
echo "===============set passwd:123456=================="
mysql -e "use mysql;update user set host = '%' where user = 'root';FLUSH PRIVILEGES;"
mysql -e "SET PASSWORD = PASSWORD('123456');ALTER USER 'root'@'localhost' PASSWORD EXPIRE NEVER;FLUSH PRIVILEGES;"
service mysql restart
sed -i "s/skip-grant-tables/#skip-grant-tables/g"/etc/my.cnf
echo -e "\ncomplete"
chkconfig --add mysql
chkconfig  mysql on
echo -e "\nMYSQL INSTALL SUCCESS!"
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
1. 互联网广告推荐 互联网广告推荐是一个常见的大数据分析案例。通过对用户的历史浏览记录、点击行为、购买记录等数据进行分析,可以精准地为用户推荐符合其兴趣和需求的广告。Hadoop平台可以用来处理海量的用户数据,利用HDFS存储数据,使用MapReduce计算用户行为数据的指标,如点击率、转化率等,同时采用机器学习算法对用户数据进行分析和建模,从而为广告推荐提供更加精准的支持。 2. 金融风险管理 金融风险管理是一个重要的大数据应用领域。金融机构需要对大量的交易数据、客户数据、市场数据等进行分析,以识别潜在的风险和机会。Hadoop平台可以用来处理大量的金融数据,利用Hive和Pig进行数据分析和预处理,使用Spark进行数据挖掘和建模,以及利用HBase进行数据存储和查询。这些技术可以帮助金融机构更好地管理风险、优化投资组合和提高收益。 3. 医疗健康管理 医疗健康管理是一个典型的大数据应用领域。医疗机构需要对大量的患者数据、医疗数据、研究数据等进行分析,以提高医疗质量、降低成本和改善患者体验。Hadoop平台可以用来处理大量的医疗数据,利用Hive和Pig进行数据分析和预处理,使用Spark进行数据挖掘和建模,以及利用HBase进行数据存储和查询。这些技术可以帮助医疗机构更好地管理患者数据、提高医疗质量和降低成本。 4. 物流管理 物流管理是一个需要大数据支持的领域。物流企业需要对大量的运输数据、仓储数据、供应链数据等进行分析,以提高物流效率、降低成本和提高客户满意度。Hadoop平台可以用来处理大量的物流数据,利用Hive和Pig进行数据分析和预处理,使用Spark进行数据挖掘和建模,以及利用HBase进行数据存储和查询。这些技术可以帮助物流企业更好地管理物流数据、提高物流效率和降低成本。 5. 社交网络分析 社交网络分析是一个重要的大数据应用领域。社交网络企业需要对大量的用户数据、社交关系数据、内容数据等进行分析,以提高用户留存、增加用户黏性和提高广告收入。Hadoop平台可以用来处理大量的社交网络数据,利用Hive和Pig进行数据分析和预处理,使用Spark进行数据挖掘和建模,以及利用HBase进行数据存储和查询。这些技术可以帮助社交网络企业更好地管理用户数据、提高用户留存和增加广告收入。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值