hive 实现 wc

本文介绍如何使用Hive实现WordCount统计及呼叫掉话率分析,通过创建外部表读取HDFS数据,利用Hive函数split和explode进行单词切割与分组计数,同时展示了基于基站记录分析呼叫质量的方法。
摘要由CSDN通过智能技术生成
实现hive 中的 word count

我们的原始数据

hello world hi
hi hell hadoop
hive hbase spark
hello hi

将数据上传到 hdfs 中 /usr/ 目录下

hdfs dfs -put wc /usr/

hive中创建外部表读取上面数据

create external table wc
(
line string
) 
location '/usr/;

创建新表wc_result获取wc表中的wordcount的值

create table wc_result
(
word string,
ct int
); 

使用hive中的函数split , explode

hive> select split(line, ' ') from wc;
OK
["hello","world","hi"]
["hi","hell","hadoop"]
["hive","hbase","spark"]
["hello","hi"]
Time taken: 0.123 seconds, Fetched: 4 row(s)
hive> select explode(split(line, ' ')) from wc;
OK
hello
world
hi
hi
hell
hadoop
hive
hbase
spark
hello
hi
Time taken: 0.071 seconds, Fetched: 11 row(s)

将数据分组计数后存入 wc_result

hive> 
    > from (select explode(split(line, ' ')) word from wc) t 
    > insert into wc_result
    > select word, count(word) group by word;
Query ID = root_20191121104910_4ab6976a-16df-413b-a20d-906bb933feb5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1574264220363_0002, Tracking URL = http://node003:8088/proxy/application_1574264220363_0002/
Kill Command = /opt/hdp/hadoop-2.6.5/bin/hadoop job  -kill job_1574264220363_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-11-21 10:49:19,658 Stage-1 map = 0%,  reduce = 0%
2019-11-21 10:49:29,010 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.44 sec
2019-11-21 10:49:37,294 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.9 sec
MapReduce Total cumulative CPU time: 2 seconds 900 msec
Ended Job = job_1574264220363_0002
Loading data to table default.wc_result
Table default.wc_result stats: [numFiles=1, numRows=8, totalSize=60, rawDataSize=52]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.9 sec   HDFS Read: 7973 HDFS Write: 133 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 900 msec
OK
Time taken: 29.623 seconds

输入命令会产生 mapreduce 计算,等待后就会产生结果
查看wc_result表

hive> select word, ct from wc_result;
OK
hadoop	1
hbase	1
hell	1
hello	2
hi	3
hive	1
spark	1
world	1
Time taken: 0.085 seconds, Fetched: 8 row(s)
hive关于掉话率的统计

在这里插入图片描述

方法和上面类似
创建基础表

create table jizhan
(
record_time string,
imei string,
cell string,
ph_num int,
call_num int,
drop_num int,
duration int,
drop_rate int,
net_type string,
erl int)
row format delimited fields terminated by ',';

读取数据

load data local inpath '/opt/csv' into table jizhan;

创建结果表


create table jizhan_result
(
imei string,
drop_num int,
duration int,
drop_rate double
);

处理基础表生成结果存入结果表

from jizhan 
insert into jizhan_result
select imei,sum(drop_num) sdrop, sum(duration) sdur, sum(drop_num)/sum(duration) srate group by imei 
order by srate desc;

上面这个过程中进行mr计算
查看结果

select * from jizhan_result limit 10;

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值