【填坑之旅-hadoop-06】hadoop2.10.1(基于hdfs mr)HIVE 1.2.2/jkd1.8/mysql 8.0.23 安装元数据库(mysql) hql语言 UDF PIG

最新推荐文章于 2023-11-08 15:08:01 发布

alwarse

最新推荐文章于 2023-11-08 15:08:01 发布

阅读量535

点赞数

分类专栏：大数据hadoop-填坑文章标签： hive mysql hadoop pig

本文链接：https://blog.csdn.net/alwarse/article/details/121080107

版权

大数据hadoop-填坑专栏收录该内容

13 篇文章 0 订阅

订阅专栏

hive简介

在这里插入图片描述

hive 安装

Hive只在一个节点上安装即可

1.上传tar包

2.解压

tar -zxvf hive-0.9.0.tar.gz -C /cloud/

3.配置mysql metastore（切换到root用户）

配置HIVE_HOME环境变量
rpm -qa | grep mysql
rpm -e mysql-libs-5.1.66-2.el6_3.i686 --nodeps
rpm -ivh MySQL-server-5.1.73-1.glibc23.i386.rpm 
rpm -ivh MySQL-client-5.1.73-1.glibc23.i386.rpm 
修改mysql的密码
/usr/bin/mysql_secure_installation
（注意：删除匿名用户，允许用户远程连接）
登陆mysql
mysql -u root -p

4.配置hive hive-site.xml

cp hive-default.xml.template hive-site.xml 
修改hive-site.xml（删除所有内容，只留一个<property></property>）
添加如下内容：
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://weekend01:3306/hive?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>root</value>
  <description>password to use against metastore database</description>
</property>

jdbc可能需要添加参数
createDatabaseIfNotExist=true
nullCatalogMeansCurrent=true

<configuration>
  <!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
  <!-- WARNING!!! Any changes you make to this file will be ignored by Hive.   -->
  <!-- WARNING!!! You must make your changes in hive-site.xml instead.         -->
  <!-- Hive Execution Parameters -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.50.56:3306/hive?createDatabaseIfNotExist=true&amp;nullCatalogMeansCurrent=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>

5.安装hive和mysq完成后，将mysql的连接jar包拷贝到$HIVE_HOME/lib目录下

如果出现没有权限的问题，在mysql授权(在安装mysql的机器上执行)
mysql -uroot -p
#(执行下面的语句  *.*:所有库下的所有表   %：任何IP地址或主机都可以连接)
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123' WITH GRANT OPTION;
FLUSH PRIVILEGES;

文件内容

001001 iphone13 64 5000
001002 iphone13pro 128 6000
001003 s21 64 4000

6.建表(默认是内部表) 分区表外部表

create table trade_detail(id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t';

建分区表 partitioned by
create table td_part(id bigint, account string, income double, expenses double, time string) partitioned by (logdate string) row format delimited fields terminated by '\t';
建外部表 external
create external table td_ext(id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t' location '/td_ext';

7.创建分区表 partitioned by (pubdate string)

普通表和分区表区别：有大量数据增加的需要建分区表
create table book (id bigint, name string) partitioned by (pubdate string) row format delimited fields terminated by '\t';

分区表加载数据 oad data local inpath XXX (overwrite ) into XXX partition （aaa=‘xxx’）

load data local inpath './book.txt' overwrite into table book partition (pubdate='2010-08-22');

load data local inpath '/root/data.am' into table beauty partition (nation="USA");

Hql select avg(size) from XXX

select nation, avg(size) from beauties group by nation order by avg(size);

mysql 元数据库

dbs
在这里插入图片描述

tbls
在这里插入图片描述

columns_v2

在这里插入图片描述

partition_keys
在这里插入图片描述
sds

hdfs://master:9000/user/hive/warehouse

HQL 数据类型

在这里插入图片描述

HQL 语法 DDL DML

在这里插入图片描述

show tables;

drop table t_phone_1;

set hive.cli.print.header=true;

update t_phone set capity=128 where id =1001;

CREATE TABLE

CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT ‘IP Address of the User’)
COMMENT ‘This is the page view table’
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\001’
STORED AS SEQUENCEFILE; TEXTFILE

//stored as sequencefile

create table tab_ip_seq(id int,name string,ip string,country string)
row format delimited
fields terminated by ‘,’
stored as sequencefile;
insert overwrite table tab_ip_seq select * from tab_ext;

textfile,sequencefile和rcfile的三种存储格式的本质和区别

在这里插入图片描述

//create & load

create table tab_ip(id int,name string,ip string,country string)
row format delimited
fields terminated by ‘,’
stored as textfile;
load data local inpath ‘/home/hadoop/ip.txt’ into table tab_ext;

//external LOCATION

CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
ip STRING,
country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/external/hive’;

// CREATE TABLE AS 用于创建一些临时表存储中间结果

CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;

//insert from select 用于向临时表中追加中间结果数据

create table tab_ip_like like tab_ip;

insert overwrite table tab_ip_like
select * from tab_ip;

//CLUSTER <–相对高级一点，你可以放在有精力的时候才去学习>

create table tab_ip_cluster(id int,name string,ip string,country string)
clustered by(id) into 3 buckets;

load data local inpath ‘/home/hadoop/ip.txt’ overwrite into table tab_ip_cluster;
set hive.enforce.bucketing=true;
insert into table tab_ip_cluster select * from tab_ip;

select * from tab_ip_cluster tablesample(bucket 2 out of 3 on id);

//PARTITION create load select

create table tab_ip_part(id int,name string,ip string,country string)
partitioned by (part_flag string)
row format delimited fields terminated by ‘,’;

load data local inpath ‘/home/hadoop/ip.txt’ overwrite into table tab_ip_part
partition(part_flag=‘part1’);

load data local inpath ‘/home/hadoop/ip_part2.txt’ overwrite into table tab_ip_part
partition(part_flag=‘part2’);

select * from tab_ip_part;

select * from tab_ip_part where part_flag=‘part2’;
select count(*) from tab_ip_part where part_flag=‘part2’;

alter table tab_ip change id id_alter string;
ALTER TABLE tab_cts ADD PARTITION (partCol = ‘dt’) location ‘/external/hive/dt’;

show partitions tab_ip_part;

//write to hdfs / insert (overwrite ) local directory

insert overwrite local directory ‘/home/hadoop/hivetemp/test.txt’ select * from tab_ip_part where part_flag=‘part1’;
insert overwrite directory ‘/hiveout.txt’ select * from tab_ip_part where part_flag=‘part1’;

hdfs to hdfs

insert overwrite directory ‘/hiveout.txt’ select * from tab_ip_part where part_flag=‘part1’;

//array fields / collection items

create table tab_array(a array,b array)
row format delimited
fields terminated by ‘\t’
collection items terminated by ‘,’;

示例数据
tobenbrone,laihama,woshishui 13866987898,13287654321
abc,iloveyou,itcast 13866987898,13287654321

select a[0] from tab_array;
select * from tab_array where array_contains(b,‘word’);
insert into table tab_array select array(0),array(name,ip) from tab_ext t;

//map fields / collection items /map keys

create table tab_map(name string,info map<string,string>)
row format delimited
fields terminated by ‘\t’
collection items terminated by ‘;’
map keys terminated by ‘:’;

示例数据：
fengjie age:18;size:36A;addr:usa
furong age:28;size:39C;addr:beijing;weight:180KG

load data local inpath ‘/home/hadoop/hivetemp/tab_map.txt’ overwrite into table tab_map;
insert into table tab_map select name,map(‘name’,name,‘ip’,ip) from tab_ext;

//struct

create table tab_struct(name string,info structage:int,tel:string,addr:string)
row format delimited
fields terminated by ‘\t’
collection items terminated by ‘,’

load data local inpath ‘/home/hadoop/hivetemp/tab_st.txt’ overwrite into table tab_struct;
insert into table tab_struct select name,named_struct(‘age’,id,‘tel’,name,‘addr’,country) from tab_ext;

//cli shell

hive -S -e ‘select country,count(*) from tab_ext’ > /home/hadoop/hivetemp/e.txt
有了这种执行机制，就使得我们可以利用脚本语言（bash shell,python）进行hql语句的批量执行

select * from tab_ext sort by id desc limit 5;

select a.ip,b.book from tab_ext a join tab_ip_book b on(a.name=b.name);

//UDF 自定义函数

select if(id=1,first,no-first),name from tab_ext;

hive>add jar /home/hadoop/myudf.jar;
hive>CREATE TEMPORARY FUNCTION my_lower AS ‘org.dht.Lower’;
select my_upper(name) from tab_ext;

add jar /home/hadoop/hibevearea.jar

CREATE TEMPORARY FUNCTION getarea AS 'cn.itcast.bigdata.PhoneNbrToArea';

select getarea(phonenum) area,col1,col2 from t_connect_big;

PhoneNbrToArea.class

package cn.itcast.bigdata;

import java.util.HashMap;

import org.apache.hadoop.hive.ql.exec.UDF;


public class PhoneNbrToArea extends UDF{

	private static HashMap<String, String> areaMap = new HashMap<>();
	static {
		areaMap.put("1388", "beijing");
		areaMap.put("1399", "tianjin");
		areaMap.put("1366", "nanjing");
	}
	
	//一定要用public修饰才能被hive调用
	public String evaluate(String pnb) {
		
		String result  = areaMap.get(pnb.substring(0,4))==null? (pnb+"    huoxing"):(pnb+"  "+areaMap.get(pnb.substring(0,4)));		
		
		return result;
	}
	
}

select count(*) mapreduce

在这里插入图片描述

<property>
   <name>hive.metastore.schema.verification</name>
   <value>true</value>
 </property>

hive spark

在这里插入图片描述

hive 结构原理

在这里插入图片描述

hive 架构设计

在这里插入图片描述

PIG

pig简介

PIG介绍：
Pig是MapReduce的一个抽象。是一个平台/工具，用于分析较大的数据集，并将它们表示为数据流。
Pig包括两部分内容：
(1)用于描述数据流的语言，称为 Pig Latin
(2)用于运行Pig Latin程序的执行坏境，当前有两个环境：单JVM的本地执行坏境 - Hadoop集群的分布式坏境
为什么需要Pig：
不太擅长Java的程序员通常不习惯使用Hadoop(MapReduce),Pig主要是解决该问题的。

Pig 执行模式：
pig -x local 本地模式
pig -x mapreduce MapReduce模式

本地模式在Grunt Shell 界面中，只能使用Linux的命令，不能使用hadoop的命令
在mapreduce下既能用Linux也可以用hadoop命令

执行Linux命令： sh … eg: sh ls /
执行HDFS命令： fs … eg: fs -ls /

Pig Latin 语言：
在使用pig时要将等号和关系、操作符分隔开。
A = LOAD… 而不是A=LOAD

数据类型：int long float double chararray boolean datetime
标识符：用来表示关系的名称（别名）、字段、变量等

读时模式：读取数据过程中严格匹配数据类型
写时模式：写入数据过程中匹配数据类型
原子：Pig
Latin中的任何单个值，无论其数据类型，都称为原子。
Tuple-元组：有序字段集合形成的记录。（关系型数据库中的行类似）
语法： (field,[field,…])

Bag-包:一个包是一组无序的元组（类似关系型数据库中表）
语法：{tuple,[tuple,…]}

Map-映射：映射是一组key-value键值对 key:chararray类型，且是唯一。value:任意类型
语法： [key,value]
Latin中关系是无需的，不能保证按特定顺序处理元组。

Pig Latin中，大小写严格区分，内置关键字大小写不区分，用户自定义函数大小写区分
关系运算符：
LOAD:用于将数据从文件系统加载到关系中
A = LOAD ‘数据文件’ [USING PigStorage(‘分隔符’)] [AS(name:type…)]
STORE:用于将数据从系统存储在文件中
FILTER:从关系中删除不需要行
DISTINCT:从关系中删除重复行
FOREACH,GENERATE:基于数据列生成的数据转换
DUMP:用于显示关系中的数据
JOIN:连接两个或者多个关系
内连接：
B = JOIN RelationA by A_field ,RelationB by B_field;
左外连接：左边有，右边没数据 LEFT OUTER
右外连接：右边有，左边没数据 RIGHT OUTER
全外链接：左右两边数据笛卡尔积查询 FULL OUTER
GROUP:在单个关系中对数据进行分组
OREDER:排序
LIMIT:从关系中获取有限数量的元组

内置函数：AVG() MAX() MIN() COUNT() SUM()
TOKENIZE()：拆分字段（空格、双引号、逗号、括号、星号）
FLATTEN()：将包、tuple打开，获取里面的内容

引用关系：关系可以通过名称、位置引用。
位置引用：$0 表示第一个字段

原文链接

pig使用安装

1.安装Pig
将pig添加到环境变量当中

2.pig使用
首先将数据库中的数据导入到HDFS上
sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table trade_detail --target-dir ‘/sqoop/td’
sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table user_info --target-dir ‘/sqoop/ui’

td = load '/sqoop/td' using PigStorage(',') as (id:long, account:chararray, income:double, expenses:double, time:chararray);
ui = load '/sqoop/ui' using PigStorage(',') as (id:long, account:chararray, name:chararray, age:int);

td1 = foreach td generate account, income, expenses, income-expenses as surplus;

td2 = group td1 by account;

td3 = foreach td2 generate group as account, SUM(td1.income) as income, SUM(td1.expenses) as expenses, SUM(td1.surplus) as surplus;

tu = join td3 by account, ui by account;

result = foreach tu generate td3::account as account, ui::name, td3::income, td3::expenses, td3::surplus;

store result into '/result' using PigStorage(',');

pig latin 使用

eg1:分隔name age 后age+5

 A = LOAD 'data' USING PigStorage('#') AS (name:chararray,age:int);
 DUMP A;
 B = foreach A generate age+5;
 DUMP B;

eg2:

A = LOAD 'data' AS(t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));

eg3:求和

文本：
21/32
12/12
11/11
22/22
33/111

 A = LOAD 'data' USING PigStorage('/') AS (s1:int,s2:int);
 B = foreach A generate s1+s2;

eg4:求平均值

文本：
984815151/1950/05118/0000
984815151/1950/05118/0022
984815151/1950/05118/0011
984815151/1949/05118/0111
984815151/1949/05118/0078
984815151/1937/05118/0001
984815151/1937/05118/0002
Pig Latin:

A = LOAD 'temp' USING PigStorage('/') AS (id:chararray,year:int,flag:chararray,temp:int); 	 
B = GROUP A by year; 	 
C = FOREACH B GENERATE group,AVG(A.temp); 	 
D = FOREACH B GENERATE group,MAX(A.temp); 最大值

eg5:

文本1：
1001,甘宁,文员,1013,2010-12-17,8000.00,20
1002,黛绮丝,销售员,1006,2011-02-20,16000.00,30
1003,殷天正,销售员,1006,2011-02-22,12500.00,30
1004,刘备,经理,1009,2011-04-02,29750.00,20
1005,谢逊,销售员,1006,2011-09-28,12500.00,30
1006,关羽,经理,1009,2011-05-11,28500.00,30
1007,张飞,经理,1009,2011-09-12,24500.00,10
1008,诸葛亮,分析师,1004,2017-04-19,30000.00,20
1009,增阿牛,董事长,2009-11-17,50000.00,10
1010,韦一笑,销售员,1006,2011-06-09,15000.00,30
1011,周泰,文员,1008,2017-05-23,11000.00,20
1012,程普,文员,1006,2017-09-12,9500.00,30
1013,庞统,分析师,1004,2011-12-03,30000.00,20
1014,黄盖,文员,1007,2012-01-23,13000.00,10
1015,张三,保洁员,1001,2013-05-01,80000.00,50

查询上级领导:
Pig Latin:

emp = LOAD 'emp' USING PigStorage(',') AS(id:int,name:chararray,work:chararray,pid:int,time:chararray,sal:double,dept:int); 	 
emp1 = LOAD 'emp' USING PigStorage(',') AS(id:int,name:chararray,work:chararray,pid:int,time:chararray,sal:double,dept:int); 	
test1 = JOIN emp BY id,emp1 BY pid; 	 result = foreach test1 generate $8,$1

文本2：
10,教研部,北京
20,学工部,上海
30,销售部,广州
40,财务部,武汉

查询部门:
Pig Latin:

 dept = LOAD 'dept' USING PigStorage(',') AS (did:int,dname:chararray,loc:chararray);
 test2 = JOIN emp by dept,dept by did;
 result2 = foreach test2 generate $1,$8

eg6: 词频统计

文本:
hadoop c c++
hello hadoop java hello
c c++

Pig Latin:

 A = LOAD 'count' AS(line:chararray);
 B = FOREACH A GENERATE TOKENIZE(line);  /*拆分*/
 C = FOREACH B GENERATE FLATTEN($0);	/*打开包，获得全部内容*/
 D = GROUP C by $0;						/*分组*/
 E = FOREACH D GENERATE group,COUNT(C);

alwarse

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【填坑之旅-hadoop-06】hadoop2.10.1(基于hdfs mr)HIVE 1.2.2/jkd1.8/mysql 8.0.23 安装元数据库(mysql) hql语言 UDF PIG

hive简介hive 安装Hive只在一个节点上安装即可1.上传tar包2.解压tar -zxvf hive-0.9.0.tar.gz -C /cloud/3.配置mysql metastore（切换到root用户）配置HIVE_HOME环境变量rpm -qa | grep mysqlrpm -e mysql-libs-5.1.66-2.el6_3.i686 --nodepsrpm -ivh MySQL-server-5.1.73-1.glibc23.i386.rpm rpm -
复制链接

扫一扫