大数据框架基础入门Hadoop Streaming

最新推荐文章于 2023-05-24 07:25:11 发布

为人三会

最新推荐文章于 2023-05-24 07:25:11 发布

阅读量154

点赞数

文章标签：程序员编程语言大数据编程

本文链接：https://blog.csdn.net/okmjsayu/article/details/95795362

版权

Hadoop数据流是Hadoop自带发行的实用程序。该实用程序允许创建和运行Map/Reduce任务的任何可执行文件或脚本映射器和/或减速器。

使用Python示例

对于Hadoop的数据流，我们考虑的字计数问题。任何工作在Hadoop中必须有两个阶段：映射器和减速器。我们使用python脚本代码映射器和减速器在Hadoop下运行它。使用Perl和Ruby也是类似的。

映射阶段代码

  !/usr/bin/python  import sys  # Input takes from standard input for myline in sys.stdin:   # Remove whitespace either side myline = myline.strip()   # Break the line into words words = myline.split()   # Iterate the words list for myword in words:   # Write the results to standard output print '%st%s' % (myword, 1)

请确保此文件具有执行权限（使用chmod +x /home/ expert/hadoop-1.2.1/mapper.py）。

减速器阶段代码

  #!/usr/bin/python  from operator import itemgetter   import sys   current_word = ""  current_count = 0   word = ""   # Input takes from standard input for myline in sys.stdin:   # Remove whitespace either side myline = myline.strip()   # Split the input we got from mapper.py word, count = myline.split('t', 1)   # Convert count variable to integer      try:         count = int(count)   except ValueError:      # Count was not a number, so silently ignore this line continue  if current_word == word:      current_count += count   else:      if current_word:         # Write result to standard output print '%st%s' % (current_word, current_count)      current_count = count     current_word = word  # Do not forget to output the last word if needed!   if current_word == word:      print '%st%s' % (current_word, current_count)

保存mapper.py和reducer.py 在 Hadoop 的主目录映射器和减速器代码。确保这些文件具有执行权限（使用chmod +x mapper.py 和 chmod +x reducer.py）。由于python具有大小写敏感，因此相同的代码可以从以下链接下载。

wordCount程序的执行

  $ $HADOOP_HOME/bin/hadoop jar contrib/streaming/hadoop-streaming-1.  2.1.jar      -input input_dirs       -output output_dir       -mapper <path/mapper.py       -reducer <path/reducer.py

其中“”用于续行以便于阅读。

例如，

  ./bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -input myinput -output myoutput -mapper /home/expert/hadoop-1.2.1/mapper.py -reducer /home/expert/hadoop-1.2.1/reducer.py

数据流工作原理

在上面的例子中，这两个映射器和减速是从标准输入读取作为输入，并输出到标准输出到Python脚本。实用程序将创建一个Map/Reduce作业，并将作业提交到一个合适的集群，并监督工作的进展情况，直至完成。

当指定映射器的脚本，每个映射任务将启动脚本作为一个单独的进程时映射器初始化。作为mapper任务运行时，输入转换成行给进程的标准输入（STDIN）。在此期间，映射器收集从该方法的标准输出（stdout）面向行输出和每一行转换为键/值对，其被收集作为映射器的输出。缺省情况下，一行到第一个制表符的前缀是键和行（不包括制表符）的其余部分为值。如果在该行没有任何制表符，则整行键和值被视为null。然而，这可以被定制，每次需要1个。

当指定减速脚本，每个减速器任务将启动脚本作为一个单独的进程，然后减速初始化。减速器任务运行时将其转换其输入键/值对，进入行并将该行进程的标准输入（STDIN）。在此期间，在减速机收集来自该过程的标准输出（stdout）的面向行的输出，每行转换成一个密钥/值对，其被收集作为减速机的输出。缺省情况下，一行到第一个制表符的前缀是键，（不包括制表符）的其余部分的值为行。然而，这可以被定制为每次具体要求。

重要的命令

参数	描述
-input directory/file-name	输入位置映射。（必填）
-output directory-name	输出位置的减速器。（必填）
-mapper executable or script or JavaClassName	映射器可执行文件。（必填）
-reducer executable or script or JavaClassName	减速器的可执行文件。（必填）
-file file-name	使现有的映射器，减速机，或组合的可执行本地计算节点上。
-inputformat JavaClassName	类，应该提供返回键/值对文字类。如果没有指定，使用TextInputFormat作为默认。
-outputformat JavaClassName	类，提供应采取键/值对文字类的。如果没有指定，使用TextOutputformat作为默认值。
-partitioner JavaClassName	类，确定哪个减少一个键被发送。
-combiner streamingCommand or JavaClassName	组合可执行文件映射输出。
-cmdenv name=value	通过环境变量数据流的命令。
-inputreader	对于向后兼容性：指定记录读取器类（而不是输入格式类）。
-verbose	详细的输出。
-lazyOutput	创建懒输出。例如，如果输出格式是基于FileOutputFormat，输出文件仅在第一次调用output.collect（或Context.write）创建。
-numReduceTasks	指定减速器的数目。
-mapdebug	当map任务失败的脚本调用。
-reducedebug	脚本调用时降低任务失败。

推荐学习目录： MapReduce简介和入门

为人三会

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据框架基础入门Hadoop Streaming

Hadoop数据流是Hadoop自带发行的实用程序。该实用程序允许创建和运行Map/Reduce任务的任何可执行文件或脚本映射器和/或减速器。使用Python示例对于Hadoop的数据流，我们考虑的字计数问题。任何工作在Hadoop中必须有两个阶段：映射器和减速器。我们使用python脚本代码映射器和减速器在Hadoop下运行它。使用Perl和Ruby也是类似的。映射阶段代码 ...
复制链接

扫一扫