Hive管理表的方式
-
使用create创建一个新表
create table if not exists db_web_data.track_log( 字段 ) partitioned by (date string,hour string) (创建分区表) row format delimited fields terminated by '\t';
-
把一张表的某些字段抽取出来,创建成一张新表
create table backup_track_log as select * from db_web_data.track_log;
尖叫提示:会复制属性以及属性值到新的表中
-
复制表结构
例如:create table like_track_log like db_web_data.track_log;
尖叫提示:不会复制属性值,只会复制表结构。会调用mapreduce
Hive表导入数据方式
1、本地导入
load data local inpath 'local_path/file' into table 表名称 ;
2、HDFS导入
load data inpath 'hdfs_path/file' into table 表名称 ;
3、覆盖导入
load data local inpath 'path/file' overwrite into table 表名称 ;
load data inpath 'path/file' overwrite into table 表名称 ;
4、查询导入
create table track_log_bak as select * from db_web_data.track_log;
5、insert导入
** 追加-append-默认方式
insert into table 表名 select * from track_log;
** 覆盖-overwrite-显示指定-使用频率高
insert overwrite table 表名 select * from track_log;
Hive表导出数据方式
1、本地导出
例如:insert overwrite local directory "/home/admin/Desktop/1/2" row format delimited fields terminated by '\t' select * from db_hive_demo.emp ;
尖叫提示:会递归创建目录
2、HDFS导出
例如:insert overwrite diretory "path/" select * from staff;
3、Bash shell覆盖追加导出
例如:$ bin/hive -e "select * from staff;" > /home/z/backup.log
4、Sqoop
hive数据清洗之思路
需求:实现按照省份和日期统计PV和UV
SELECT date, provinceId, count(url) pv, count(distinct guid) uv from track_log group by date, provinceId;
清洗网站Log数据
Mapper类
package bh.shy.demo.etl;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class LogCleanMapper extends Mapper<LongWritable,Text,LongWritable,Text>{
@Override
protected void map(LongWritable key, Text value,Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] splits = line.split("\t");
//看看这一行数据是否有缺
if(splits.length<32){
context.getCounter("Web Data ETL", "Data Length is too short").increment(1L);
//这行数据被抛弃了
return;
}
//过滤url
String url = splits[1];
if(StringUtils.isBlank(url)){
context.getCounter("Web Data ETL", "url is black").increment(1L);
return;
}
String provinceId = splits[20];
int provinceIdInt = Integer.MAX_VALUE;
try{
provinceIdInt = Integer.parseInt(provinceId);
}catch (Exception e) {
context.getCounter("Web Data ETL", "provinceId is UNKNOWN").increment(1L);
}
if(StringUtils.isBlank(provinceId) || provinceIdInt == Integer.MAX_VALUE){
context.getCounter("Web Data ETL", "provinceId is UNKNOWN").increment(1L);
return;
}
context.write(key, value);
}
}
Reduce类
package bh.shy.demo.etl;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LogCleanMapReduce {
/**
* @param args
* @throws IOException
* @throws InterruptedException
* @throws ClassNotFoundException
*/
public static void main(String[] args) throws Exception {
run(args);
}
public static int run(String[] args) throws Exception{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(LogCleanMapReduce.class);
Path inPath = new Path(args[0]);
FileInputFormat.addInputPath(job, inPath);
job.setMapperClass(LogCleanMapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setReducerClass(LogCleanReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
Path outPath = new Path(args[1]);
FileOutputFormat.setOutputPath(job, outPath);
boolean isSuccess = job.waitForCompletion(true);
return isSuccess ? 0 : 1;
}
}
Driver类
package bh.shy.demo.etl;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class LogCleanReduce extends Reducer<LongWritable, Text, Text, NullWritable> {
@Override
protected void reduce(LongWritable key, Iterable<Text> values,Context context)
throws IOException, InterruptedException {
for (Text value : values) {
context.write(value, NullWritable.get());
}
}
}
Hive自动化处理日志
需求:网站产生的日志会按照某个日期分类的文件夹存储,定期分析该日志,产出结果
/opt/modules/weblog
-20170724(文件夹)
-2017072418(日志文件)
-2017072419
-20170725
-2017072518(日志文件)
-2017072619
#!/bin/bash
. /etc/profile
#定义hive目录
HIVE_DIR=/apps/hive-0.13.1-cdh5.3.6
#定义日志存储路径
WEB_LOG=/apps/weblog
#拿到昨天的日期,用于访问目录
YESTERDAY=$(date --date="1 day ago" +%Y%m%d)
#遍历目录
for i in `ls $WEB_LOG`
do
DATE=${i:0:8}
HOUR=${i:8:2}
$HIVE_DIR/bin/hive \
--hiveconf LOADFILE_NEW=$WEB_LOG/$YESTERDAY/$i \
--hiveconf DATE_NEW=$DATE \
--hiveconf HOUR_NEW=$HOUR \
-f $HIVE_DIR/hql/auto.hql
done
业务案例梳理
需求:执行周期性任务,每天的晚上6点,执行自动化脚本,加载昨天的日志文件到HDFS,同时分析网站的多维数据(PV,UV按照省份和小时数进行分类查询)最后将查询的结果,存储在一张临时表中(表字段:date,hour,provinceId,pv,uv)存储在HIVE中,并且将该临时表中的所有数据,存储到MySQL中,以供第二天后台开发人员的调用,展示。
-
定时加载本地数据到HDFS,涉及到auto.sh,crontab
-
清洗数据,打包jar,定时执行
/user/hive/warehouse/db_web_data.db/track_log/date=20150828/hour=18
part-000001
/user/hive/warehouse/db_web_data.db/track_log/date=20150828/hour=19
part-000001
-
建表,也不需要建立现成得分区,临时指定清洗好的数据作为仓库源。
alter table track_log add partition(date=‘20150828’,hour=18) location “…”
alter table track_log add partition(date=‘20150828’,hour=19) location “…” -
开始分析想要的数据,将结果存储在Hive的临时表中
创建临时表: create table if not exists temp_track_log(date string, hour string, provinceId string, pv string, uv string) row format delimited fields terminated by '\t'; 向临时表中插入数据: insert overwrite table temp_track_log select date, hour, provinceId, count(url) pv, count(distinct guid) uv from track_log where date='20150828' group by date, hour, provinceId;
Hive中的几种排序
order by
全局排序,就一个Reduce
sort by
相当于对每一个Reduce内部的数据进行排序,不是全局排序。
distribute by
类似于MRpartition, 进行分区,一般要结合sort by使用。
cluster by
当distribute和sort字段相同时,就是cluster by