- 自己写的一个简单例子,用来做话题描述去重,表中的desc字段 “a-b-a-b-b-c”需要去重
- python代码如下:
- #!/usr/bin/python
- import sys
- reload(sys)
- sys.setdefaultencoding('utf8')
- def quchong(desc):
- a=desc.split('-')
- return '-'.join(set(a))
- while True:
- line = sys.stdin.readline()
- if line == "":
- break
- line = line.rstrip('\n')
- # your process code here
- parts = line.split('\t')
- parts[2]=quchong(parts[2])
- print "\t".join(parts)
- 下面是转载过来的,比较详细
- 二、hive map中字段自增的写法(转)
- 1、建立表结构
- hive> CREATE TABLE t3 (foo STRING, bar MAP<STRING,INT>)
- > ROW FORMAT DELIMITED
- > FIELDS TERMINATED BY '/t'
- > COLLECTION ITEMS TERMINATED BY ','
- > MAP KEYS TERMINATED BY ':'
- > STORED AS TEXTFILE;
- OK
- 2、建成的效果
- hive> describe t3;
- OK
- foo string
- bar map<string,int>
- 3、生成test.txt
- jeffgeng click:13,uid:15
- 4、把test.txt load进来
- hive> LOAD DATA LOCAL INPATH 'test.txt' OVERWRITE INTO TABLE t3;
- Copying data from file:/root/src/hadoop/hadoop-0.20.2/contrib/hive-0.5.0-bin/bin/test.txt
- Loading data to table t3
- OK
- load完效果如下
- hive> select * from t3;
- OK
- jeffgeng {"click":13,"uid":15}
- 5、可以这样查map的值
- hive> select bar['click'] from t3;
- ...一系列的mapreduce...
- OK
- 13
- 6、编写add_mapper
- #!/usr/bin/python
- import sys
- import datetime
- for line in sys.stdin:
- line = line.strip()
- foo, bar = line.split('/t')
- d = eval(bar)
- d['click'] += 1
- print '/t'.join([foo, str(d)])
- 7、在hive中执行
- hive> CREATE TABLE t4 (foo STRING, bar MAP<STRING,INT>)
- > ROW FORMAT DELIMITED
- > FIELDS TERMINATED BY '/t'
- > COLLECTION ITEMS TERMINATED BY ','
- > MAP KEYS TERMINATED BY ':'
- > STORED AS TEXTFILE;
- hive> add FILE add_mapper.py
- INSERT OVERWRITE TABLE t4
- > SELECT
- > TRANSFORM (foo, bar)
- > USING 'python add_mapper.py'
- > AS (foo, bar)
- > FROM t3;
- FAILED: Error in semantic analysis: line 1:23 Cannot insert into target table because column number/types are different t4: Cannot convert column 1 from string to map<string,int>.
- 8、为什么会报出以上错误?貌似add_mapper.py的输出是string格式的,hive无法此这种格式的map认出。后查明,AS后边可以为字段强制指定类型
- INSERT OVERWRITE TABLE t4
- SELECT
- TRANSFORM (foo, bar)
- USING 'python add_mapper.py'
- AS (foo string, bar map<string,int>)
- FROM t3;
- 9、同时python脚本要去除字典转换后遗留下来的空格,引号,左右花排号等
- #!/usr/bin/python
- import sys
- import datetime
- for line in sys.stdin:
- line = line.strip('/t')
- foo, bar = line.split('/t')
- d = eval(bar)
- d['click'] += 1
- d['uid'] += 1
- strmap = ''
- for x in str(d):
- if x in (' ', "'"):
- continue
- strmap += x
- print '/t'.join([foo, strmap])
- 10、执行后的结果
- hive> select * from t4;
- OK
- jeffgeng {"click":14,"uid":null}
- Time taken: 0.146 seconds
python通过hive transform处理数据
最新推荐文章于 2022-04-26 16:14:45 发布