Hive之——Streaming

最新推荐文章于 2024-07-09 01:59:54 发布

冰河

最新推荐文章于 2024-07-09 01:59:54 发布

阅读量1.4k

点赞数 1

分类专栏：精通大数据系列文章标签： Hive Streaming

本文链接：https://blog.csdn.net/l1028386804/article/details/88624013

版权

精通大数据系列专栏收录该内容

269 篇文章 88 订阅

订阅专栏

转载请注明出处：https://blog.csdn.net/l1028386804/article/details/88624013

Hive中提供了多个语法来使用streaming，包括：MAP()、REDUCE()和TRANSFORM()。要注释的是MAP()实际上并非可以强制在map阶段执行streaming，正如reduce并非可以强制在reduce阶段执行streaming。因为这个原因，对于相同的功能，通常建议使用TRANSFORM()语句，这样可以避免误导读者对查询语句产生疑惑。
这里，我们创建一个表a来执行我们的streaming例子，如下：

hive> create table a (a int, b int) row format delimited fields terminated by '\t';
hive> load data local inpath '/root/a.txt';
hive> select * from a;
OK
4       5
3       2

hive> desc a;
OK
a                       int                                         
b                       int

恒等变换

最基本的streaming job就是恒等运算。/bin/cat这个shell命令可以将传递给它的数据直接输出，所以满足恒等运算。

hive> select transform(a, b) using '/bin/cat' as newA, newB from a;
OK
4       5
3       2

hive> select transform(a,b) using '/bin/cat' as newA from a;
OK
4
3

hive> select transform(a) using '/bin/cat' as newA, newB from a;
4       NULL
3       NULL
Time taken: 12.955 seconds, Fetched: 2 row(s)

改变类型

transform返回的字段数据类型默认是字符串类型。通过如下语法将类型转换成其他数据类型:

hive> select transform(a, b) using '/bin/cat' as (newA int, newB double) from a;
OK
4       5.0
3       2.0

投影变换

Streaming中可以使用cut命令提取或者映射出特定的字段。换句话说，可以达到和select语句相同的行为:

hive> select transform(a,b) using '/bin/cut -f1' as newA, newB from a;
OK
4       NULL
3       NULL
hive> select transform(a,b) using '/bin/cut -f2' as newA, newB from a;
OK
5       NULL
2       NULL

可以看出，查询从外部处理过程中返回的只有一个字段，而实际期望的是2个字段，因此，newB的值总是NULL，默认情况下，transform需要2个字段，不过实际上可以为比其小的任意个数的字段:

hive> select transform(a,b) using '/bin/cut -f1' as newA, newB from a;
OK
4       NULL
3       NULL

hive> select transform(a,b) using '/bin/cut -f2' as newA, newB from a;
OK
5       NULL
2       NULL

hive> select transform(a,b) using '/bin/cut -f1' as newA from a;
OK
4
3

hive> select transform(a,b) using '/bin/cut -f2' as newA from a;
OK
5
2

操作转换

/bin/sed程序(对于Mac OS X系统是/usr/bin/sed)是一个流编辑器。可以接受输入数据流，按照用户的指定进行编辑，最后将编辑后的结果输出到输出流中。如下例子中将字符串"4"替换成了字符串"10":

hive> select transform(a, b) using '/bin/sed s/4/10' as newA, newB from a;
OK
10	5
3	2

使用分布式内存

当一个查询所需要的文件没有在每个TaskTracker上事先安装好时，用户需要使用分布式缓存数据或者程序文件传输到集群中，然后在job完成后会清理掉这些数据和文件。缓存中会独立保存每个job的缓存文件，而不会相互干扰。
下面的例子是一个将摄氏温度转换成华氏温度的bash shell脚本ctof.sh：

#!/bin/bash
while read LINE
do
	res = $(echo "scale=2;((9/5) * $LINE) + 32" | bc)
	echo $res
done

Hive的ADD FILE功能可以将文件加入到分布式缓存中，而焙增加的文件会被存储到每个task节点机器的当前工作目录下。这样可以使得transform task直接使用脚本而不用确定到哪里去找这些文件:

hive> add file ${env:HOME}/prog_hive/ctof.sh
Added resource: /home/binghe/prog_hive/ctof.sh

hive> select transform(a) using 'ctof.sh' as convert from a;
39.20
37.40

由一行产生多行
可以使用Streaming对每个输入行产生多行输出。这个

给定一个输入文件 $HOME/kv_data.txt 内容如下：

k1=v1,k2=v2,k3=v3
k4=v4,k5=v5,k6=v6
k7=v7,k8=v8,k9=v9

希望将这些数据以表格的形式展示，可以让常见的HiveQL操作符处理这些行:

k1	v1
k2	v2
k3	v3
k4	v4
k5	v5
k6	v6
k7	v7
k8	v8
k9	v9

接下来我们创建Perl脚本，保存为/usr/local/src/split_kv.pl:

#/usr/bin/perl
while (<STDIN>){
	my $line = $_;
	chomp($line);
	my @kvs = split(/,/,$line);
	foreach my $p (@kvs){
		my @kv = split(/=/, $p);
		print $kv[0] . "\t" . $kv[1] . "\n"; 
	}
}

接下来我们创建表kv_data，并将数据导入到表kv_data：

hive> create table kv_data(line string);
hive> load data local inpath '/usr/local/src/data.txt' into table kv_data;
hive> select * from kv_data;
OK
k1=v1,k2=v2,k3=v3
k4=v4,k5=v5,k6=v6
k7=v7,k8=v8,k9=v9

接下来我们将split_kv.pl脚本导入的Hive中，并执行查询

hive> add file /usr/local/src/split_kv.pl;
Added resources: [/usr/local/src/split_kv.pl]

hive> select transform(line) using 'perl split_kv.pl' as (key, value) from kv_data;
OK
k1      v1
k2      v2
k3      v3
k4      v4
k5      v5
k6      v6
k7      v7
k8      v8
k9      v9

使用Streaming进行聚合计算

可以使用streaming做类似于Hive的内置函数SUM一样的聚合运算。这是因为streaming处理过程可以对每行输入返回0或多行输出。

为了能在一个外部应用程序中完成聚合，脚本中在循环外面先定义了一个计算器，然后，循环内部不断从输入流中读取输入并进行计算，最终输出sum求和结果：

脚本： aggregate.pl

#!/usr/bin/perl
my $sum=0;
while (<STDIN>){
	my $line = $_;
	chomp($line);
	$sum=${sum} + ${line};
}
print $sum;

接下来，我们创建表sum，导入整数数据，每行一个数据，用于测试：
数据文件int.txt

hive> create table sum(number int);
hive> load data local inpath '/usr/local/src/int.txt' into table sum;
hive> select * from sum;
OK
1
2
3
4
5
6
7
8
9

将streaming程序增加到分布式缓存中，并在transform查询中使用:

hive> add file /usr/local/src/aggregate.pl;
Added resources: [/usr/local/src/aggregate.pl]

hive> select transform(number) using 'perl aggregate.pl' as total from sum;
OK
45

注意：不能像UDAF SUM()那样在单个查询中执行多个transform过程。如下：

hive> select sum(number) as one, sum(number) as two from sum;
OK
45      45

cluster by、distribute by、sort by

具有相同键的数据需要分发到用一个处理节点上，或者数据需要按照指定列或指定函数进行排序。
可以使用cluster by语句，其可以确保类似的数据可以分发到同一个reduce task中，而且保证数据是有序的。
比如，这里我们通过transform方式实现Word Count算法。
我们需要创建两个Python脚本，一个将读取的文本的每行内容分割成单词;另一个脚本用于接受字频数据流以及单词的中间计数值(大多数是数字"1"),然后对每个单词的词频求和汇总。

第一个脚本mapper.py，按照空格将每行内容分割成单词:

import sys
for line in sys.stdin:
	words = line.strip().split()
	for word in words:
		print "%s\t1" % (word.lower())

第二个脚本reducer.py，需要缓存当前处理的单词，以及这个单词出现的次数，当处理下一个单词时，这个脚本需要输出上一个单词的词频数，然后重置缓存.

import sys
(last_key, last_count) = (None, 0)
for line in sys.stdin:
	(key, count) = line.strip().split("\t")
	if last_key and last_key != key:
		print "%s\t%d" % (last_key, last_count)
		(last_key, last_count) = (key, int(count))
	else:
		last_key = key
		last_count += int(count)
if last_key:
	print "%s\t%d" % (last_key, last_count)

建立我们的测试数据集文件test.txt

hello world word
hello test
test liuyazhuang
hello binghe
test binghe
word count

接下来，我们在Hive中进行处理:

hive> create table docs(line string);
hive> load data local inpath '/usr/local/src/test.txt' into table docs;
hive> create table word_count(word string, count int) row format delimited fields terminated by '\t';

hive> add file /usr/local/hive-2.3.4/mapper.py;
Added resources: [/usr/local/hive-2.3.4/mapper.py]

hive> add file /usr/local/hive-2.3.4/reducer.py;
Added resources: [/usr/local/hive-2.3.4/reducer.py]

select transform(line) using 'mapper.py' as word, count from docs;
from (
from docs
select transform(line) using 'mapper.py'
as word, count
cluster by word) wc
insert overwrite table word_count
select transform(wc.word, wc.count) using 'reducer.py'
as word, count;

替代cluster by的最方便的方式就是使用distribute by和sort by。场景是：用户期望将数据按照某个字段划分，然后按照另一个字段排序。
实际上，cluster by word等价于distribute by word sort by word asc。

如下transform查询的字频输出结果是按照降序排序的:

from (
from docs
select transform(line) using 'mapper.py'
as word, count
distribute by word sort by word desc) wc
insert overwrite table word_count
select transform(wc.word, wc.count) using 'reducer.py'
as word, count;

使用cluster by 或者结合使用sort by的distribute by很重要，因为如果没有这些指示，Hive可能无法合理并行执行job，所有的数据可能都会分发到同一个reducer，这样会导致整体job执行时间延长。

GenericMR Tools for Streaming to Java

使用Java编写streaming也是可以的，而且Hive中包含了GenericMR API视图为streaming提供类似于Hadoop API 这样的接口:

from(
from src 
map value, key
using 'java -cp hive-contrib-3.1.1.jar org.apache.hadoop.hive.contrib.mr.example.IdentityMapper' 
as k, v 
cluster by k) map_output
reduce k, v 
using 'java -cp hive-contrib-3.1.1.jar org.apache.hadoop.hive.contrib.mr.example.WordCountReduce' 
as k, v;

我们先了解一下GenericMR提供的接口。其中Mapper接口用于实现常见的Mapper实现方法，其提供了一个map放大，输入是字符串数组类型的String[]的列值：

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.hive.contrib.mr;

/**
 * Mapper.
 */
public interface Mapper {
  /**
   * Maps a single row into an intermediate rows.
   * 
   * @param record
   *          input record
   * @param output
   *          collect mapped rows.
   * @throws Exception
   *           on error
   */
  void map(String[] record, Output output) throws Exception;
}

IndentityMapper不会对输入数据做任何改变而会将其传递给收集器(collector)。在功能上与/bin/cat一致。

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.hive.contrib.mr.example;

import org.apache.hadoop.hive.contrib.mr.GenericMR;
import org.apache.hadoop.hive.contrib.mr.Mapper;
import org.apache.hadoop.hive.contrib.mr.Output;

/**
 * Example Mapper (Identity).
 */
public final class IdentityMapper {

  public static void main(final String[] args) throws Exception {
    new GenericMR().map(System.in, System.out, new Mapper() {
      @Override
      public void map(final String[] record, final Output output) throws Exception {
        output.collect(record);
      }
    });
  }

  private IdentityMapper() {
    // prevent instantiation
  }
}

Reducer 接口提供了第一列字符串，而其他列可以通过记录迭代器获得。每次迭代都会返回一堆字符串，其中第0个元素是重复的键而其下一次元素是值。output对象表示输出结果：

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.hive.contrib.mr;

import java.util.Iterator;

/**
 * Simple reducer interface.
 */
public interface Reducer {
  /**
   * Reduce.
   * 
   * Note that it is assumed that the key is the first column. Additionally, the
   * key will be repeated as the first column in the records[] array.
   * 
   * @param key
   *          key (first column) for this set of records.
   * @param records
   *          Iterator of records for this key. Note that the first column of
   *          record will also be the key.
   * @param output
   * @throws Exception
   */
  void reduce(String key, Iterator<String[]> records, Output output)
      throws Exception;
}

WordCountReducer类中有一个累加器用于对记录迭代器中的每个元素进行计数。当所有记录都被计数后，就会生成一个由键及其对应次数所组成的数组作为结果:

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.hadoop.hive.contrib.mr.example;

import java.util.Iterator;

import org.apache.hadoop.hive.contrib.mr.GenericMR;
import org.apache.hadoop.hive.contrib.mr.Output;
import org.apache.hadoop.hive.contrib.mr.Reducer;

/**
 * Example Reducer (WordCount).
 */
public final class WordCountReduce {

  private WordCountReduce() {
    // prevent instantiation
  }

  public static void main(final String[] args) throws Exception {
    new GenericMR().reduce(System.in, System.out, new Reducer() {
      public void reduce(String key, Iterator<String[]> records, Output output)
          throws Exception {
        int count = 0;

        while (records.hasNext()) {
          // note we use col[1] -- the key is provided again as col[0]
          count += Integer.parseInt(records.next()[1]);
        }

        output.collect(new String[] {key, String.valueOf(count)});
      }
    });
  }
}

计算cogroup

在MapReduce程序中会对多数据集进行JOIN连接处理，然后使用TRANSFORM进行处理。使用UNION ALL 和 CLUSTER BY，我们可以实现GROUP BY操作的常见效果。

假设我们有多个不同源的日志文件，它们具有相同的schema。我们期望将它们合并在一起，然后通过一个reduce_script脚本进行分析。

from(
from(
from order_log ol
--User Id, Order Id, and timestamp:
select ol.userid as uid, ol.orderid as id, av.ts as ts
union all
from clicks_log cl
select cl.userid as uid, cl.id as id, ac.ts as ts
)union_msgs
select union_msgs.uid, union_msgs.id, union_msgs.ts 
cluster by union_msgs.uid, union_msgs.ts
)map
insert overwrite table log_analysis 
select transform(map.uid, map.id, map.ts) using 'reduce_script' 
as (uid, id, ...);