[Hadoop]Python MRJob实现将两个文件合并

问题一:通过一张用户表和一张订单表生成每个用户的订单

输入:

uid uname
01 user1 
02 user2
03 user3
uid orderid order_price
01   01     80
01   02     90
02   03    82
02   04    95

输出:

"01:user1"	"01:80,02:90"
"02:user2"	"03:82,04:95"

1. 存储数据

用两个csv文件两张表

# user_table.csv
uid uname
01 user1 
02 user2
03 user3
# order_table.csv
uid orderid order_price
01   01     80
01   02     90
02   03    82
02   04    95

2. Mapper

请添加图片描述

	SORT_VALUES = True
	def mapper(self, _, line):
		fields = line.strip().split('\t')
		if len(fields) == 2:
			source = 'A'
			user_id = fields[0]
			user_name = fields[1]
			if not user_id.isdigit():
				return
			yield user_id, [source, user_name]
		elif len(fields) == 3:
			source = 'B'
			user_id = fields[0]
			order_id = fields[1]
			price = fields[2]
			if not user_id.isdigit():
				return
			yield user_id, [source, order_id, price]
		else:
			pass

结合图片和代码段:
Mapper中,为了能够在下一步Sort排序相同user_id的行时,用户表的行在前,订单表的行在后(下一张图中有详细说明),这里用source = Asource = B来区分用户表和订单表。
并且加了 SORT_VALUES = True

Secondary sort 1
MRJob.SORT_VALUES = None
Set this to True if you would like reducers to receive the values associated with any key in sorted order (sorted by their encoded value). Also known as secondary sort.

如果希望reducers按排序顺序(按编码值排序)接收与任何键相关联的值,请将其设置为True。也称为辅助排序。

根据文档的解释,把SORT_VALUES设置为True后,在排序时,会按照与键相关联的值编码值(本例中的source=A)的顺序作为辅助排序。

图为Sort的结果

2. Reducer

请添加图片描述

	def reducer(self, user_id, values):
		values = [v for v in values]
		if len(values) > 1:
			user_name = values[0][1]
			order_info = [':'.join([v[1],v[2]]) for v in values[1:]]
			yield ':'.join([user_id, user_name]), ','.join(order_info)

Reducer部分,结合图和代码段,目的是输出如下格式:

"01:user1"	"01:80,02:90"
"02:user2"	"03:82,04:95"

比较复杂的是order_info = [’:’.join([v[1],v[2]]) for v in values[1:]]
图中粉色笔迹部分解释了这段代码。

3. 完整代码

# MergeUserOrderTable.py
#!/miniconda2/bin/python
# encoding=utf-8

from mrjob.job import MRJob

class MergeUserOrderTable(MRJob):
	SORT_VALUES = True
	def mapper(self, _, line):
		fields = line.strip().split('\t')
		if len(fields) == 2:
			source = 'A'
			user_id = fields[0]
			user_name = fields[1]
			if not user_id.isdigit():
				return
			yield user_id, [source, user_name]
		elif len(fields) == 3:
			source = 'B'
			user_id = fields[0]
			order_id = fields[1]
			price = fields[2]
			if not user_id.isdigit():
				return
			yield user_id, [source, order_id, price]
		else:
			pass
	
	def reducer(self, user_id, values):
		values = [v for v in values]
		if len(values) > 1:
			user_name = values[0][1]
			order_info = [':'.join([v[1],v[2]]) for v in values[1:]]
			yield ':'.join([user_id, user_name]), ','.join(order_info)

def main():
	MergeUserOrderTable.run()

if __name__ == '__main__':
	main()

4. 运行

python MergeUserOrderTable.py ./user_table.csv ./order_table.csv -o output_order

output_order是接受运行结果的文件夹

5. 结果

# part-00000
"01:user1"	"01:80,02:90"
# part-00001
"02:user2"	"03:82,04:95"
# part-00002

问题二:还是合并两个表,显示效果为每个用户所下订单的订单总量和累计消费金额

输入:

uid uname
01 user1 
02 user2
03 user3
uid orderid order_price
01   01     80
01   02     90
02   03    82
02   04    95

输出:

"01:user1"	[2, 170]
"02:user2"	[2, 177]

1. 数据存储,Mapper部分同上

2. Reducer

	def reducer(self, user_id, values):
		values = [v for v in values]
		if len(values) > 1:
			user_name = values[0][1]
			order_count = len(values) - 1
			order_info = [v[2] for v in values[1:]]
			# price_count = sum(order_info)
			price_count = 0
			for price in order_info:
				price_count = price_count + int(price)
			yield ':'.join([user_id, user_name]), [order_count, price_count]

Reducer部分,目的是输出如下格式:

"01:user1"	[2, 170]
"02:user2"	[2, 177]

3. 完整代码

# UserOrderCount.py
#!/miniconda2/bin/python
# encoding=utf-8

from mrjob.job import MRJob

class UserOrderCount(MRJob):
	SORT_VALUES = True
	def mapper(self, _, line):
		fields = line.strip().split('\t')
		if len(fields) == 2:
			source = 'A'
			user_id = fields[0]
			user_name = fields[1]
			if not user_id.isdigit():
				return
			yield user_id, [source, user_name]
		elif len(fields) == 3:
			source = 'B'
			user_id = fields[0]
			order_id = fields[1]
			price = fields[2]
			if not user_id.isdigit():
				return
			yield user_id, [source, order_id, price]
		else:
			pass
	
	def reducer(self, user_id, values):
		values = [v for v in values]
		if len(values) > 1:
			user_name = values[0][1]
			order_count = len(values) - 1
			order_info = [v[2] for v in values[1:]]
			# price_count = sum(order_info)
			price_count = 0
			for price in order_info:
				price_count = price_count + int(price)
			yield ':'.join([user_id, user_name]), [order_count, price_count]

def main():
	UserOrderCount.run()

if __name__ == '__main__':
	main()

4. 运行

python UserOrderCount.py ./user_table.csv ./order_table.csv -o ./output_price

output_price是接受运行结果的文件夹

5. 结果

# part-00000
"01:user1"	[2, 170]
# part-00001
"02:user2"	[2, 177]
# part-00002

以上


  1. https://mrjob.readthedocs.io/en/latest/job.html?highlight=SORT_VALUES#mrjob.job.MRJob.SORT_VALUES ↩︎

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值