大数据mapper书写范式hdfs

雨_刃

已于 2024-08-07 12:04:00 修改

阅读量137

点赞数 1

文章标签： python

于 2024-08-07 10:56:28 首次发布

本文链接：https://blog.csdn.net/YENTERTAINR/article/details/140985670

版权

文章目录

1. 大数据mapper书写范式hdfs

1. 大数据mapper书写范式hdfs

import json
import sys

def read_input(input_stream):
    for line in input_stream:
        yield line.rstrip('\n')
        
def load_json_data(json_line):
    try:
        data = json.loads(json_line)
        unique_id = data.get('id')
        combined_content = ' '.join([data.get('title', ''), data.get('text', '')])
        return unique_id, combined_content
    except json.JSONDecodeError:
        return None, None

def mapper(input_stream, output_stream=sys.out):
	processed_ids = set()
	for json_line in read_input(input_stream):
		id, text = load_json_data(json_line)
		if filter():
			output_stream.write(json_line + "\n")
			processed_ids.add(id)
def getKeywords():
	pass
if __name__ == "main":
	mapper(sys.stdin)