今天在学hive的时候,发现hive在写查询的时候,可以调用Python,shell对结果进行进一步的处理,就写了几个测试的例子,对结果集用Linux命令进行处理,发现还不错
建表:
使用add加入Python脚本:CREATE TABLE u_data_new (userid INT,movieid INT,rating INT,weekday INT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t';
插入数据(调用Python脚本):add FILE weekday_mapper.py;
weekday_mapper.py 内容如下 :INSERT OVERWRITE TABLE u_data_newSELECTTRANSFORM (userid, movieid, rating, unixtime)USING 'python weekday_mapper.py'AS (userid, movieid, rating, weekday)FROM u_data;
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday =
datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
hive调用shell命令简单示例:
hive调用shell命令简单示例:
建测试表:
CREATE TABLE IF NOT EXISTS cjhtest
(
id STRING,
name STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
导入数据:
LOAD DATA LOCAL INPATH '/home/hadoop/cjh/test' INTO TABLE cjhtest;
查看下数据:
cat命令简单测试:hive> select * from cjhtest;OK12345 chengjihui32444 duanbin39479 luobao42947 gaojianTime taken: 0.075 seconds, Fetched: 4 row(s)
hive> SELECT TRANSFORM(a.id,a.name) USING '/bin/cat -n' AS (num,id,name) FROM cjhtest a;
也可以直接写Linux命令,如下:Query ID = hadoop_20150915011212_ab2c282c-85cc-4c23-9f3b-618b4c8d9127Total jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1442286555409_0018, Tracking URL = http://cjh1:8088/proxy/application_1442286555409_0018/Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1442286555409_0018Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02015-09-15 01:12:26,427 Stage-1 map = 0%, reduce = 0%2015-09-15 01:12:30,622 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.68 secMapReduce Total cumulative CPU time: 680 msecEnded Job = job_1442286555409_0018MapReduce Jobs Launched:Stage-Stage-1: Map: 1 Cumulative CPU: 0.68 sec HDFS Read: 261 HDFS Write: 86 SUCCESSTotal MapReduce CPU Time Spent: 680 msecOK1 12345 chengjihui2 32444 duanbin3 39479 luobao4 42947 gaojianTime taken: 11.19 seconds, Fetched: 4 row(s)
Query ID = hadoop_20150915011414_fa583fb9-c3fa-46b2-8c36-50d70bbd14cd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1442286555409_0019, Tracking URL = http://cjh1:8088/proxy/application_1442286555409_0019/
Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1442286555409_0019
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-15 01:14:08,728 Stage-1 map = 0%, reduce = 0%
2015-09-15 01:14:13,954 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.7 sec
MapReduce Total cumulative CPU time: 700 msec
Ended Job = job_1442286555409_0019
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 0.7 sec HDFS Read: 261 HDFS Write: 86 SUCCESS
Total MapReduce CPU Time Spent: 700 msec
OK
1 12345 chengjihui
2 32444 duanbin
3 39479 luobao
4 42947 gaojian
Time taken: 11.383 seconds, Fetched: 4 row(s)
下面是写了两个awk,wc 的简单测试:
awk(我这里把awk的模式和行为用""括起来,''会报错)
hive> SELECT TRANSFORM(a.id,a.name) USING '/bin/awk "{print $2}"' AS (name) FROM cjhtest a;
Query ID = hadoop_20150915011919_059f6261-10eb-4cd6-85d7-77fdb17a06f3--------------------------------------------------------------------------------------------------
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1442286555409_0023, Tracking URL = http://cjh1:8088/proxy/application_1442286555409_0023/
Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1442286555409_0023
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-15 01:19:58,051 Stage-1 map = 0%, reduce = 0%
2015-09-15 01:20:02,347 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.7 sec
MapReduce Total cumulative CPU time: 700 msec
Ended Job = job_1442286555409_0023
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 0.7 sec HDFS Read: 261 HDFS Write: 34 SUCCESS
Total MapReduce CPU Time Spent: 700 msec
OK
chengjihui
duanbin
luobao
gaojian
Time taken: 11.426 seconds, Fetched: 4 row(s)
hive> SELECT TRANSFORM(a.id,a.name) USING '/bin/awk "/cheng/ {print $2}"' AS (name) FROM cjhtest a;
Query ID = hadoop_20150915012121_7577c0e6-965c-42a7-810b-cc032889bfd8Total jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1442286555409_0024, Tracking URL = http://cjh1:8088/proxy/application_1442286555409_0024/Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1442286555409_0024Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02015-09-15 01:21:30,298 Stage-1 map = 0%, reduce = 0%2015-09-15 01:21:35,546 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.71 secMapReduce Total cumulative CPU time: 710 msecEnded Job = job_1442286555409_0024MapReduce Jobs Launched:Stage-Stage-1: Map: 1 Cumulative CPU: 0.71 sec HDFS Read: 261 HDFS Write: 11 SUCCESSTotal MapReduce CPU Time Spent: 710 msecOKchengjihuiTime taken: 11.34 seconds, Fetched: 1 row(s)
------------------------------------------------------------------------------------直接写awk的命令,去掉/bin,也执行成功
hive> SELECT TRANSFORM(a.id,a.name) USING 'awk "/cheng/ {print $2}"' AS (name) FROM cjhtest a;
Query ID = hadoop_20150915012323_4ae115f1-8501-4197-8627-bb2341826f16
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1442286555409_0025, Tracking URL = http://cjh1:8088/proxy/application_1442286555409_0025/
Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1442286555409_0025
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-09-15 01:23:13,054 Stage-1 map = 0%, reduce = 0%
2015-09-15 01:23:18,286 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.69 sec
MapReduce Total cumulative CPU time: 690 msec
Ended Job = job_1442286555409_0025
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 0.69 sec HDFS Read: 261 HDFS Write: 11 SUCCESS
Total MapReduce CPU Time Spent: 690 msec
OK
chengjihui
Time taken: 11.452 seconds, Fetched: 1 row(s)
-----------------------------------------------------------------------------------------------------
wc -l 命令:
hive> SELECT TRANSFORM(a.id,a.name) USING 'wc -l' AS (cc) FROM cjhtest a;
------------------------------------------------------------------------------------------------------------------Query ID = hadoop_20150915012424_e214d72d-87a6-4762-92ac-0b86e58c149cTotal jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1442286555409_0026, Tracking URL = http://cjh1:8088/proxy/application_1442286555409_0026/Kill Command = /home/hadoop/hadoop/bin/hadoop job -kill job_1442286555409_0026Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02015-09-15 01:24:59,995 Stage-1 map = 0%, reduce = 0%2015-09-15 01:25:05,256 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.85 secMapReduce Total cumulative CPU time: 850 msecEnded Job = job_1442286555409_0026MapReduce Jobs Launched:Stage-Stage-1: Map: 1 Cumulative CPU: 0.85 sec HDFS Read: 261 HDFS Write: 2 SUCCESSTotal MapReduce CPU Time Spent: 850 msecOK4Time taken: 11.486 seconds, Fetched: 1 row(s)