hadoop streaming两个数据文件实现join合并操作_hdfs 把2个合并同名的文件-CSDN博客

本文链接：https://blog.csdn.net/levy_cui/article/details/68945542

hadoop做数据处理，大都是对集合进行操作，因此将数据文件与另一个数据文件进行join的操作需求非常常见。

有很多人询问，下面将汇总一个例子让入门的朋友掌握编写方法:

[hdfs@server1]$ more clean_item_new
100002303,3368
100002865,11991
100003592,7995
100004955,7033
100006838,12630
100006855,6648
100006966,8561
100007628,6648
100008089,7791
100009560,11406
100010089,3064

... ...

字段说明(user_id,classify)

[hdfs@server1]$ more clean_user_new
10001082,285259775,1,08
10001082,4368907,1,12
10001082,4368907,1,12
10001082,53616768,1,02
10001082,151466952,1,12
10001082,53616768,4,02
10001082,290088061,1,12
10001082,298397524,1,12
10001082,32104252,1,12
10001082,323339743,1,12
10001082,396795886,1,12
10001082,9947871,1,28
10001082,150720867,1,15

... ...

字段说明(user_id,item_id,behavior_type,time)

使用python编写mapper和reducer代码：

[hdfs@server1]$ cat clean_data_user_item_join.py
#!/usr/bin/python
# -*- coding:utf-8 -*-

import sys
import gc
import os

gc.disable()

def mapper():
    #filename = os.environ.get('mapreduce_map_input_file').split('/')
    filepath = os.environ["map_input_file"] //获取多个文件
    filename = os.path.split(filepath)[-1] 
    #print '---------------%s--------------------' % (filename)
    for line in sys.stdin:
        if line.strip()=="":
            continue
        fields = line[:-1].split(",")
        if filename == 'clean_item_new':  //判断名字
            item_id = fields[0]
            item_category = fields[1]
            print ','.join((item_id,'0',item_category)) //输出内容添加一个0数据，为了reducer中的处理便捷

        if filename == 'clean_user_new':
            user_id = fields[0]
            item_id = fields[1]
            behavior_type = fields[2]
            time = fields[3]
            print ','.join((item_id,'1',user_id,behavior_type,time))

def reducer():
    last_item_id = ""
    for line in sys.stdin:
        if line.strip()=='':
            continue
        fields = line[:-1].split(",")
        item_id = fields[0]
        if item_id != last_item_id:  //因为maper后数据已经排序好，下面将处理合并，判断此行的item_id是与上一行的相同
            item_category=""
            if fields[1]=="0":
                item_category=fields[2].strip()
        elif item_id==last_item_id:
            if fields[1]=="1":
                user_id=fields[2].strip()
                behavior_type=fields[3].strip()
                time=fields[4].strip()
                if item_category:
                    print ','.join((last_item_id,user_id,item_category,behavior_type,time))
        last_item_id = item_id //重新赋值last_item_id

d = {'mapper':mapper, 'reducer':reducer}
if sys.argv[1] in d:
    d[sys.argv[1]]()

[hdfs@server1]$ cat clean_data_user_item_join.sh
#!/bin/bash

hdfs_input_path=/user/hdfs/tchi/input/*
hdfs_output_path=/user/hdfs/tchi/output

hdfs dfs -rm -r ${hdfs_output_path}

hadoop jar /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapreduce.job.queuename=root \
-D mapred.job.name='Clean Data' \
-D stream.num.map.output.key.fields=1 \
-D mapred.map.tasks=10  \
-D mapred.reduce.tasks=6 \
-input ${hdfs_input_path} \
-output ${hdfs_output_path} \
-file clean_data_user_item_join.py \
-mapper "python clean_data_user_item_join.py mapper" \
-reducer "python clean_data_user_item_join.py reducer"

hadoop steaming编写比较简单，掌握mapper和reducer就好。