MapReduce特性(三) 连接-Join DataSets

本文介绍了在大数据处理中,各种高级框架(如Pig、Hive、Cascading、Crucial、Spark)如何实现连接操作。根据数据集的规模和分区方式不同,连接操作分为数据集较小情况下的分发连接及数据集较大时的map端连接和reduce端连接。map端连接要求输入数据预先进行分区和排序,而reduce端连接更加灵活但效率较低。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

把Join作为实现的核心部分的高级框架:Pig Hive Cascading Cruc Spark

连接操作取决于数据集的规模和分区方式

1 如果其中一个数据集很小,可以分发到各个MR节点上连接

2 如果两个数据集都很大,需要考虑如何使用map端连接和reduce端连接

mapper执行连接称为 mapper端连接

reducer执行连接称为reduce端连接


map端连接

数据到达map函数之前连接

    > 输入数据必须先分区并且以特定方式排序

    > 输入数据集划分成相同数量的分区,按照相同的键排序

    > 同一个键的所有记录放在一个分区中

使用CompsiteInputFormat运行map端连接

org.apache.hadoop.examples.Join中有一个程序样例


reduce端连接

不要求输入集合特定结构,更常用

    > 两个数据集都需要经过shuffle,效率低些

    > mapper为记录标记源,使用连接键作为map输出键,键相同则输出到同一个reducer

1 多输入

    MultipleInputs

2 辅助排序

    reducer选出不同数据源中键相同的记录,不保证记录是有序的,一般是将一个源的数据排列在另一个源数据之前。

    使用辅助排序协助完成




hadoop jar /opt/program/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar words.txt hdfs://hadoop:8020/input hdfs://hadoop1:8020/output/wc Unknown program 'words.txt' chosen. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. [root@master hadoop]#
最新发布
03-14
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值