00001

1.0 Hadoop配置

  • 解压tar包
  • 配置环境变量
  • 配置文件

  • hadoop-env.sh
export JAVA_HOME=/usr/java/jdk.1.8.0_212
  • core-site.xml
<property>
	<name>fs.default.name<name>
	<value>hdfs://master:9000
<property>
  • hdfs-site.xml
<property>
	<name>dfs.replication<name>
	<value>2<value>
<property>
<property>
	<name>dfs.datenode.date.dir<name>
	<value>file:/usr/hadoop/hadoop-2.7.7/hdfs/date<value>
<property>
<property>
	<name>dfs.namenode.name.dir<name>
	<value>file:/usr/hadoop/hadoop-2.7.7/hdfs/name<value>
<property>
  • yarn-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_212
  • yarn-site.xml
<property>
	<name>yarn.resourcemanager.admin.address<name>
	<value>master:18141<value>
<property>
<property>
	<name>yarn.nodemanager.aux-services<name>
	<value>mapreduce_shuffle<value>
<property>

mapred-site.xml

<property>
	<name>mapreduce.framework.name<name>
	<value>yarn<value>
<property>
  • 设置节点文件
    vim master
    vim slaves
  • 格式化
    hadoop namenode -format
  • 启动集群
    start-all.sh start

2.0 配置Spark

2.1 配置Scala

  • 解压文件
  • 配置环境变量

2.2 配置Spark

  • 解压文件
  • 配置环境变量
  • 配置文件
    cp /usr/spark/spark/conf
    vim spark.env.sh
export SPARK_MASTER_IP=master
export SCALA_HOME=/usr/scala/scala
export JAVA_HOME=/usr/java/jdk1.8.0_212
export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
export HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop

mv slaves 追加slave1,save2

  • 开启集群 sbin/start-all.sh

3.0 配置Flume

  • 解压
  • 删除lib文件下的guava-11.0.2.jar

4.0 配置Kafka

  • 解压
  • 修改配置文件cd /usr/kafka/config
  • 修改 server.properties
#id
broker.id=0

delete.topic.enable=true

#kafka 运行日志存放的路径
log.dirs=/usr/kafka/kafka/logs

#配置连接 Zookeeper 集群地址
`zookeeper.connect=master:2181,slave1:2181,slave2:2181`

1.编辑kafka配置文件
vim /opt/kafka/kafka_2.11-2.0.0/config/server.properties

在文件中直接添加以下配置

listeners=PLAINTEXT://0.0.0.0:9092 advertised.host.name=master advertised.listeners=PLAINTEXT://192.168.3.89:9092

2.通过jps获取已启动kafka的进程,然后将其进程杀死
kill -9 进程号

3.再启动kafka 命令如下
/opt/kafka/kafka_2.11-2.0.0/bin/kafka-server-start.sh -daemon /opt/kafka/kafka_2.11-2.0.0/config/server.properties

5.0 Mysql

5.1判断是否安装mysql-community-server

rpm -aq|grep mysql-community-server

5.2 如果没有安装的话开始安装
  • 安装wget:

    yum install wget

  • 下载yum源安装包

    wget http://repo.mysql.com/mysql57-community-release-el7.rpm

  • 如何查看CentOs版本

    cat /etc/centos-release

  • 安装下载好的rpm

    rpm -ivh mysql57-community-release-el7.rpm

  • 安装Msyql

    yum -y install mysql-community-server

  • 启动Mysql服务

    systemctl start mysqld

  • 初始化操作

    grep 'temporary password' /var/log/mysqld.log

  • 更改密码

    mysql_secure_installation

    然后出现Enter password for user root:,此时输入刚刚获得的初始密码并回车。

    出现是否需要确认的一律输入:y

    当出现New password:的时候则代表要输入你更改的新密码了,输入你要更改的新密码回车并再次输入确认

  • 设置开机自动运行MySQL服务

    systemctl enable mysqld

  • 打开防火墙3306端口

    当不打开此端口则我们是无法远程网络访问MySQL数据库的,一般要打开它,以便我们可以远程连接访问。

    firewall-cmd --zone=public --add-service=mysql --permanent

    iuiJagCEv8)H

5.3 判断是否开启mysql服务

systemctl status mysqld

5.4 mysqld.log日志下生成临时密码

grep "temporary password" /var/log/mysqld.log

5.6 设置Mysql本地root用户密码为123456

5.7 登录mysql

mysql -uroot -p

2、设置密码强度为低级

mysql> set global validate_password_policy=0;

3、设置密码长度为6

mysql> set global validate_password_length=6;

4、修改本地密码

mysql> alter user'root'@'localhost'identified by '123456';

5、设置满足任意主机节点root的远程访问权限(否则后续hive无法连接mysql)

mysql> grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;

6、刷新权限

mysql> flush privileges;

1.0 将hive安装包解压到指定路径/usr/hive

mkdir -p /usr/hive
cd /usr/package
tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /usr/hive

2.0 设置hive系统环境变量

vi /etc/profile

#加入以下内容
export HIVE_HOME=/usr/hive/apache-hive-2.1.1-bin
export PATH=$PATH:$HIVE_HOME/bin

#使文件生效
source /etc/profile

3.0 设置hive运行环境

cd /usr/hive/apache-hive-2.1.1-bin/conf
mv hive-env.sh.template hive-env.sh

3.1 修改 hive-env.sh 中hadoop安装路径、hive配置文件存放路径为conf、hive运行资源库路径为lib

vi hive-env.sh

$\textcolor{SeaGreen}{修改为以下内容:} $

img

export HIVE_CONF_DIR=/usr/hive/apache-hive-2.1.1-bin/conf
export HIVE_AUX_JARS_PATH=/usr/hive/apache-hive-2.1.1-bin/lib

3.2 新建文件 hive-site.xml

vi hive-site.xml

slave1上执行,添加以下内容

<configuration>
  <!--Hive产生的元数据存放位置-->
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive_remote/warehouse</value>
</property>
    <!--数据库连接JDBC的URL地址-->
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://slave2:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
	<description>JDBC connect string for a JDBC metastore</description>
</property>
    <!--数据库连接driver,即MySQL驱动-->
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
</property>
    <!--MySQL数据库用户名-->
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
</property>
    <!--MySQL数据库密码-->
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>123456</value>
 </property>
<property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
 </property>
<property>
    <name>datanucleus.schema.autoCreateALL</name>
    <value>true</value>
 </property>
</configuration>

master上执行,添加以下内容

<configuration>
<!--Hive产生的元数据存放位置-->
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive_remote/warehouse</value>
</property>
    
<!---使用本地服务连接Hive,默认为true-->
<property>
    <name>hive.metastore.local</name>
    <value>false</value>
</property>
 
<!--连接服务器-->
<property>
    <name>hive.metastore.uris</name>
    <value>thrift://slave1:9083</value>
</property>
</configuration>

3.3 将jar包复制至$HIVE_HOME的conf目录下

因为服务器端需要和mysql进行通信,所以服务端需要 Mysql 的 lib 安装包到 Hive_Home/lib目录下,仅在slave1节点上操作。

cd /usr/package
cp mysql-connector-java-5.1.47-bin.jar /usr/hive/apache-hive-2.1.1-bin/lib

4.0 解决jline的版本冲突

将hive安装目录下的lib目录下的jline复制至hadoop目录下,具体操作如下:

cd /usr/hive/apache-hive-2.1.1-bin/lib
cp jline-2.12.jar /usr/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib

5.0 启动hive验证

每次启动 hive 前要先启动 hadoop 集群。 在master操作

cd /usr/hadoop/hadoop-2.7.3
start-dsf.sh
start-yarn.sh

slave1上执行,初始化数据库

schematool -dbType mysql -initSchema

img

slave1上执行

cd /usr/hive/apache-hive-2.1.1-bin
bin/hive --service metastore

img

master上执行

cd /usr/hive/apache-hive-2.1.1-bin
bin/hive

hive启动成功后,输入show databases,出现如下界面则为成功

报错:

org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist

修改 yarn-site.xml文件,重启即可,不需要格式化hadoop

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

6.0 Hive常用命令

hive 常用命令:

  • 插入数据

    load data local infile '/root/customer.txt'
    into table CUSTOMER 
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\n';
    
  • LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename 
    [PARTITION (partcol1=val1, partcol2=val2 ...)
    
  1. 创建分区表

    create [external] table employees(id int ,name string,age int)
    comment 'employees home adress'
    partitioned by (province string,city string);
    
  2. 向分区表插入数据

    intsert into table employees partition(province='henan',city='zhumadian')
    values(1,'zhagnsan',23),(2,'zhagnsan',23);
    
  3. 查看分区表有哪些分区

    $hive>show partitions tablename
    //查看具体分区的细节信息
    $hive>show partitions tablename partition(province='henan')
    
  4. 动态分区

    //动态分区需要修改默认参数,开启非严格模式
    set hive.exec.dynamic.partition.mode=nonstrict
     
    //动态分区只是在插入数据的时候,根据select语句的最后两个字段进行动态分区
    1.创建分区表
    create [external] table employees2(id int ,name string,age int)
    comment 'employees home adress'
    partitioned by (province string,city string);
    2.插入数据,根据province,city的值自动创建分区,字段名必须一致
    insert overwrite table employees2(province,city) 
    select id,name,age,province,city from employees;
    
  5. 导出表数据

    //LOCAL关键字 控制是输出到本地还是HDFS
    INSERT OVERWRITE [LOCAL] DIRECTORY '/home/hadoop/output' 
    [ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY '\t']
    select * from employees2; 
    
  6. 查询结果显示字段名称(header)

    set hive.cli.print.header=true
    

6.1 hive 中的over函数

窗口函数

分析函数 over(partition by 列名 order by 列名 rows between 开始位置 and 结束位置)
  • Row_Number,Rank,Dense_Rank 这三个窗口函数的使用场景非常多

  • row_number():

    从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列;通常用于获取分组内排序第一的记录;获取一个session中的第一条refer等。

  • rank():

    生成数据项在分组中的排名,排名相等会在名次中留下空位。

  • dense_rank():

    生成数据项在分组中的排名,排名相等会在名次中不会留下空位。

1,1,20201210,1607558400 2,2,20201210,1607558402 3,3,20201210,1607558403 4,4,20201210,1607558406 5,1,20201210,1607558500 6,2,20201210,1607558510 7,3,20201210,1607558520 8,5,20201210,1607558525 9,6,20201210,1607558529 10,4,20201210,1607558532 11,2,20201210,1607558535 12,5,20201210,1607558540 13,1,20201210,1607558545 14,6,20201210,1607558550 15,2,20201210,1607558560 16,1,20201210,1607558570 17,1,20201211,1607644805 18,2,20201211,1607644806 19,3,20201211,1607644809 20,4,20201211,1607644812 21,1,20201211,1607644815 22,2,20201211,1607644820 23,3,20201211,1607644828 24,5,20201211,1607644832 25,6,20201211,1607644843 26,4,20201211,1607644849 27,2,20201211,1607644856 28,5,20201211,1607644860 29,1,20201211,1607644863 30,6,20201211,1607644878 31,2,20201211,1607644885 32,1,20201211,1607644899

模拟数据

  • 创建表并导入数据

    CREATE EXTERNAL TABLE IF NOT EXISTS 'douyin_maidian_log'
    (
    	id string,
        user_id string,
        day string,
        time_stamp int
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
    load data local inpath 'filepath' into table douyin_maidian_log
    
  • 查询用户明细以及当前表中数据总量

    select * ,count(user_id) over() as total from douyin_maidian_log
    

img

  • 查询用户明细以及当前每天的用户的埋点总数

    select *,count(user_id) over(partition by day) from douyin_maidian_log
    

img

  • 查询用户明细以及当前每天的用户个数

    先查询每天的用户个数:
    	select day,count(distinct user_id) from douyin_maidian_log group by day as t2
    之后合并查询
    	select t1.*,t2.total from douyin_maidian_log t1,
       (
        select day,count(distinct user_id) as total from douyin_maidian_log group by day
        )as t2  
        where t1.day=t2.day
    
  • 查询每个用户每次在线时的在线时长

    select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log
    --------------------------------------------------------------------------------------
    select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from ( select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0;
    --------------------------------------------------------------------------------------
    select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from ( select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0) t11,
    
    

    row_number() 方法:为每一条分组记录返回一个数字

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2aFd9S57-1640088998835)(C:\Users\30645\AppData\Roaming\Typora\typora-user-images\image-20211218174231585.png)]

    select t12.user_id,t11.time_stamp-t12.time_stamp,row_number() over(partition by t12.user_id) from 
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
     
    (select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log) t1 where t1.rn%2=1) t12,
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from
     
    ( select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0) t11
    
    where t12.user_id=t11.user_id and t12.rn=t11.rn; 
    
    select t12.user_id,t11.time_stamp-t12.time_stamp,row_number() over(partition by t12.user_id) from
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
     
    (select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log) t1 where t1.rn%2=1) t12,
    
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn from
     
    (select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn from douyin_maidian_log ) t1 where t1.rn%2=0) t11
    
    where t12.user_id=t11.user_id and t12.rn=t11.rn; 
    
  • 查询每个用户的总在线时长

    select tb.user_id,max(tb.total) from
    (
        select t12.user_id user_id,t11.time_stamp-t12.time_stamp total,row_number() over(partition by t12.user_id) rn from 
    (
        select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn from
    (
        select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn 
        from douyin_maidian_log) t1 where t1.rn%2=1) t12,
    (select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn-1 rn 
     from
    (
        select user_id,time_stamp,row_number() over(partition by user_id order by time_stamp) as rn 
        from douyin_maidian_log) t1 where t1.rn%2=0) t11
     where t12.user_id=t11.user_id and t12.rn=t11.rn) tb group by tb.user_id;
    
select tb.user_id,max(tb.total)
from(
	select t12.user_id user_id,t11.time_stamp-t12.time_stamp total,row_number_()
    over(partition by t12.userid) rn
    from(
        select t1.user_id user_id,t1.time_stamp time_stamp,t1.rn rn
        from(
            select user_id,time_stamp,row_number() over(partition by user_id order by 				time_stamp) as rn from douyin_maidian_log
           	
6.1.2 hive窗口函数案例2
  • 测试数据

    20191020,11111,85
    20191020,22222,83
    20191020,33333,86
    20191021,11111,87
    20191021,22222,65
    20191021,33333,98
    20191022,11111,67
    20191022,22222,34
    20191022,33333,88
    20191023,11111,99
    20191023,22222,33
    
  • 创建测试表并导入数据

    CREATE EXTERNAL TABLE IF NOT EXISTS user_score
    (day string,
    userid string,
    score int)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
    load data local inpath '/home/test/user_score.log' into table user_score;
    
    
  • 查询每一天所有score大于80分的用户总数

    select t.day,max(t.total) from
    (select  day,userid,score,count(userid) over(partition by day rows between unbounded preceding and current row) total 
    from  user_score
    where score>80) t group by t.day;
    --------------------------------------------------------------------------------------
    select day,count(*) from user_score where score > 80 group by day order by day ;
    

    between unbounded preceding and current row可加可不加

  • 查询每个用户的当前日期分数大于80的天数

     select *,count(userid) over(partition  by userid order by day) as total
    from user_score where score>80 order by day,userid;
    
6.1.3 hive窗口函数案例3
  • 测试数据

    jack,2017-01-01,10
    tony,2017-01-02,15
    jack,2017-02-03,23
    tony,2017-01-04,29
    jack,2017-01-05,46
    jack,2017-04-06,42
    tony,2017-01-07,50
    jack,2017-01-08,55
    mart,2017-04-08,62
    mart,2017-04-09,68
    neil,2017-05-10,12
    mart,2017-04-11,75
    neil,2017-06-12,80
    mart,2017-04-13,94
    
  • 创建测试表并导入数据

    create table business
    (
    name string, 
    orderdate string,
    cost int  
    )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
    
    #加载数据
    load data local inpath "/home/test/business.log" into table business;
    
  • 查询在2017年4月份购买过的顾客及总人数

    select *,count(name) over() as total from business where substr(orderdate,1,7)='2017-04';
    

    使用语法: substr(string A, int start),substring(string A, int start) 两者用法一样,两个参数

    ​ 返回值: string

    ​ 说明:返回字符串A从start位置到结尾的字符串

  • 查询每个顾客的购买明细及每个月购买总额

    select *,sum(cost) over(partition by name,substr(orderdate,1,7)) total_amount from business;
    
  • 查询顾客上次的购买时间,如果是第一次购买上次购买时间为null

    select name,orderdate,cost,lag(orderdate,1) over(partition by name order by orderdate) last_date from business;
    
6.1.4 hive窗口函数案例4
  • 准备测试数据

    孙悟空,语文,87
    孙悟空,数学,95
    孙悟空,英语,68
    大海,语文,94
    大海,数学,56
    大海,英语,84
    宋宋,语文,64
    宋宋,数学,86
    宋宋,英语,84
    婷婷,语文,65
    婷婷,数学,85
    婷婷,英语,78
    
  • 创建表并加入数据

    create table score
    (
    name string,
    subject string, 
    score int
    ) row format delimited fields terminated by ",";
    
    #加载数据
    load data local inpath '/usr/data/test4.log' into table score;
    
    
  • 查询每门学科学生成绩排名

    select *,
    row_number() over(partition by subject order by score desc),--不并列排名
    rank() over(partition by subject order by score desc),--并列空位排名
    dense_rank() over(partition by subject order by score desc)--并列不空位
    from score;
    
    只统计数学
    select t1.name,t1.subject,t1.score,t1.ranking from(select name,subject,score,row_number() over(partition by subject order by score desc)as ranking from score) t1 where t1.subject="数学";
    
  • 查询每门学科成绩排名top n的学生

    select * from 
    (select *,row_number() over(partition by subject order by score desc) rmp from score
    ) t where t.rmp<=3;
    

6.4 hive 数据去重

一、数据全部重复

namescore
Computer1600
Phone12
Phone12

操作步骤:

  1. 复制表结构

    CREATE TABLE <new_table> LIKE <old_talbe>;
    
  2. 插入去重后的数据

    insert overwrite table <new_table> 
    select distinct * from <old_table>;
    

二、部分数据重复

namescoretype
Computer16002
Phone121
Phone151

操作步骤:

  1. 复制表结构数据

    CREATE TABLE <new_table> LIKE <old_table>;
    
  2. 插入去重后的数据

    insert overwrite table <new_table>(
    select t.name,t.score,t.type
    from(
    	select
        	name,score,type,row_number() over(distribute by name sort by score) as rn
        from <old_talbe>
      ) t where t.rn = 1
    );
    
  3. 总结

    insert overwrite table <new_table> (
    select <字段>
    from (
    select <字段>, row_number() over(distribute by <有重复的字段> sort by <重复字段的排列根据字段>) as rn
    from <old_table>
    ) t where t.rn6=1
    );
    

6.5 hive 缺失字段填充

   |select t1.name,t1.rk from(
   |select *,NVL(comment,-1) rk from region )t1

6.6 统计用户在线时长

  • 统计数据

    'aa','2020-04-28 09:30:00','login'
    'aa','2020-04-28 10:15:00','logout'
    'aa','2020-04-28 13:30:00','login'
    'aa','2020-04-28 14:15:00','logout'
    'bb','2020-04-28 06:30:00','login'
    'bb','2020-04-28 09:15:00','logout'
    'bb','2020-04-28 12:30:00','login'
    'cc','2020-04-28 11:25:00','logout'
    'cc','2020-04-28 16:30:00','login'
    'cc','2020-04-28 20:10:00','logout'
    'dd','2020-04-28 05:25:00','logout'
    'dd','2020-04-28 10:30:00','login'
    'dd','2020-04-28 11:10:00','logout'
    'dd','2020-04-28 21:15:00','login'
    
  • 创建数据表

    CREATE TABLE user_test(
    name string,
    login_ts string,
    logout_ts string
    )row format delimited fields terminated ',';
    
    #加载数据
    load data local inpath '/usr/data/test7.txt' into table use_test;
    
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

喵内噶.

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值