实现hive 中的 word count
我们的原始数据
hello world hi
hi hell hadoop
hive hbase spark
hello hi
将数据上传到 hdfs 中 /usr/ 目录下
hdfs dfs -put wc /usr/
hive中创建外部表读取上面数据
create external table wc
(
line string
)
location '/usr/;
创建新表wc_result获取wc表中的wordcount的值
create table wc_result
(
word string,
ct int
);
使用hive中的函数split , explode
hive> select split(line, ' ') from wc;
OK
["hello","world","hi"]
["hi","hell","hadoop"]
["hive","hbase","spark"]
["hello","hi"]
Time taken: 0.123 seconds, Fetched: 4 row(s)
hive> select explode(split(line, ' ')) from wc;
OK
hello
world
hi
hi
hell
hadoop
hive
hbase
spark
hello
hi
Time taken: 0.071 seconds, Fetched: 11 row(s)
将数据分组计数后存入 wc_result
hive>
> from (select explode(split(line, ' ')) word from wc) t
> insert into wc_result
> select word, count(word) group by word;
Query ID = root_20191121104910_4ab6976a-16df-413b-a20d-906bb933feb5
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1574264220363_0002, Tracking URL = http://node003:8088/proxy/application_1574264220363_0002/
Kill Command = /opt/hdp/hadoop-2.6.5/bin/hadoop job -kill job_1574264220363_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-11-21 10:49:19,658 Stage-1 map = 0%, reduce = 0%
2019-11-21 10:49:29,010 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.44 sec
2019-11-21 10:49:37,294 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.9 sec
MapReduce Total cumulative CPU time: 2 seconds 900 msec
Ended Job = job_1574264220363_0002
Loading data to table default.wc_result
Table default.wc_result stats: [numFiles=1, numRows=8, totalSize=60, rawDataSize=52]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.9 sec HDFS Read: 7973 HDFS Write: 133 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 900 msec
OK
Time taken: 29.623 seconds
输入命令会产生 mapreduce 计算,等待后就会产生结果
查看wc_result表
hive> select word, ct from wc_result;
OK
hadoop 1
hbase 1
hell 1
hello 2
hi 3
hive 1
spark 1
world 1
Time taken: 0.085 seconds, Fetched: 8 row(s)
hive关于掉话率的统计

方法和上面类似
创建基础表
create table jizhan
(
record_time string,
imei string,
cell string,
ph_num int,
call_num int,
drop_num int,
duration int,
drop_rate int,
net_type string,
erl int)
row format delimited fields terminated by ',';
读取数据
load data local inpath '/opt/csv' into table jizhan;
创建结果表
create table jizhan_result
(
imei string,
drop_num int,
duration int,
drop_rate double
);
处理基础表生成结果存入结果表
from jizhan
insert into jizhan_result
select imei,sum(drop_num) sdrop, sum(duration) sdur, sum(drop_num)/sum(duration) srate group by imei
order by srate desc;
上面这个过程中进行mr计算
查看结果
select * from jizhan_result limit 10;


本文介绍如何使用Hive实现WordCount统计及呼叫掉话率分析,通过创建外部表读取HDFS数据,利用Hive函数split和explode进行单词切割与分组计数,同时展示了基于基站记录分析呼叫质量的方法。
961

被折叠的 条评论
为什么被折叠?



