大数据实战

最新推荐文章于 2022-03-09 15:09:58 发布

cf_wu95

最新推荐文章于 2022-03-09 15:09:58 发布

阅读量569

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/cf_wu95/article/details/88942479

版权

大数据专栏收录该内容

11 篇文章 1 订阅

订阅专栏

本地数据集上传至数据仓库Hive

1.删除字段名称

sed -i '1d' small_user

2.对字段进行切分（预处理），生成的user_table.txt。注:不要直接打开，文件过大，会出错.

head -10 user_table.txt

3.为了导入到 Hive，需要先导入到HDFS中。

4.启动hive,先启动Sql Server。

2.创建数据库和外部表。注：/bigdatacase/dataset路径应该这么写。

CREATE EXTERNAL TABLE dblab.bigdata_user(id INT,uid STRING,item_id STRING,behavior_type INT,item_category STRING,visit_date DATE,province STRING) COMMENT 'Welcome to xmu dblab!' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/bigdatacase/dataset';

最后归纳一下Hive中表与外部表的区别：

在导入数据到外部表，数据并没有移动到自己的数据仓库目录下，也就是说外部表中的数据并不是由它自己来管理的！而表则不一样；
在删除表的时候，Hive将会把属于表的元数据和数据全部删掉；而删除外部表的时候，Hive仅仅删除外部表的元数据，数据是不会删除的！
那么，应该如何选择使用哪种表呢？在大多数情况没有太多的区别，因此选择只是个人喜好的问题。但是作为一个经验，如果所有处理都需要由Hive完成，那么你应该创建表，否则使用外部表！

hive数据分析

1.查出有多少用户。

select count (distinct uid) from bigdata_user;

2.查询不重复的数据有多少条(为了排除客户刷单情况)。注：嵌套语句最好取别名，就是这里的a，否则容易出错。

select count(*) from (select uid,item_id,behavior_type,item_category,visit_date,province from bigdata_user group by uid,item_id,behavior_type,item_category,visit_date,province having count(*)=1)a;

3.查询2014年12月10日到2014年12月13日有多少人浏览了商品（更确切应该是有多少条浏览记录）

select count(*) from bigdata_user where behavior_type='1' and visit_date<'2014-12-13' and visit_date>'2014-12-10';

4.以月的第n天为统计单位，依次显示第n天网站卖出去的商品的个数

select count(×), day(visit_date) from bigdata_user where behavior_type='4' group by day(visit_date);

用户行为分析

1.查询在2014-12-11 有多少条购买记录。

select count(*) from bigdata_user where visit_date='2014-12-11'and behavior_type='4';

2.查询有多少用户在2014-12-11点击了该店。

select count(*) from bigdata_user where visit_date ='2014-12-11';

两者一除就是该网站的购买比例。

3.给定购买商品的数量范围，查询某一天在该网站的购买数据量超过5次的用户id。

注：因为聚合函数通过作用于一组数据而只返回一个单个值，因此，在SELECT语句中出现的元素要么为一个聚合函数的输入值，要么为GROUP BY语句的参数，否则会出错。

 select uid from bigdata_user where behavior_type='4' and visit_date='2014-12-12' group by uid having count(behavior_type='4')>5;

4.某个地区的用户浏览网站的次数。注:HAVING语句的存在弥补了WHERE关键字不能与聚合函数联合使用的不足,(可以存在where，就是不能写成where count(behavior_type='4')>5)。

create table scan(province STRING,scan INT) COMMENT 'This is the search of bigdataday' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;//创建新的数据表进行存储
insert overwrite table scan select province,count(behavior_type) from bigdata_user where behavior_type='1' group by province;//导入数据
select * from scan;//显示结果

数据导入导出

1.将数据从hive导入到Mysql.（sqoop）。注：导出为export（hdfs to sql）。

./bin/sqoop export --connect jdbc:mysql://localhost:3306/dblab --username root --password hadoop --table user_action --export-dir '/user/hive/warehouse/dblab.db/user_action' --fields-terminated-by '\t';

./bin/sqoop export  ##表示数据从 hive 复制到 mysql 中
--connect jdbc:mysql://localhost:3306/dblab  #mysql要连接到的数据库dblab
--username root  #mysql登陆用户名
--password hadoop  #登录密码
--table user_action  #mysql 中的表，即将被导入的表名称  
--export-dir '/user/hive/warehouse/dblab.db/user_action '  #hive 中被导出的文件 
--fields-terminated-by '\t'   #Hive 中被导出的文件字段的分隔符

2.数据从MySQL导入HBase(sqoop)。注：导入为import(sql to hdfs)。

./bin/sqoop  import  --connect jdbc:mysql://localhost:3306/dblab --username root --password hadoop --table user_action --hbase-table user_action --column-family f1 --hbase-row-key id --hbase-create-table -m 1

./bin/sqoop  import  --connect  jdbc:mysql://localhost:3306/dblab
--username  root
--password  hadoop 
--table user_action
--hbase-table user_action #HBase中表名称
--column-family f1 #列簇名称
--hbase-row-key id #HBase 行键,sql中的id列
--hbase-create-table #是否在不存在情况下创建表
-m 1 #启动 Map 数量

注：HBase只支持十六进制存储中文。所以会出现如下情况：

value=\xE4\xB8\x8A\xE6\xB5\xB7\xE5\xB8\x82 #省份

3.将本地数据导入Hbase. 注：

hadoop jar /usr/local/bigdatacase/hbase/ImportHBase.jar HBaseImportTest /usr/local/bigdatacase/dataset/user_action.output

cf_wu95

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
大数据实战

本地数据集上传至数据仓库Hive1.删除字段名称sed -i '1d' small_user2.对字段进行切分（预处理），生成的user_table.txt。注:不要直接打开，文件过大，会出错.head -10 user_table.txt3.为了导入到 Hive，需要先导入到HDFS中。4.启动hive,先启动Sql Server。2.创建数据库和外部表。注...
复制链接

扫一扫