hadoop 入门 -- 图片都没了，晕，后面再补上试试

最新推荐文章于 2024-09-24 13:56:45 发布

alex_mianmian

最新推荐文章于 2024-09-24 13:56:45 发布

阅读量434

点赞数

分类专栏： hadoop 文章标签： hadoop ubuntu python streaming C++

本文链接：https://blog.csdn.net/alex_mianmian/article/details/49893025

版权

hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

最近开始学习hadoop，因此在这里记录一下学习的过程。

我的环境: 虚拟机上 ubuntu12.04 32bits 系统， hadoop是1.2.1，JDK1.6

1.安装，参考网上的文章

Ubuntu上搭建Hadoop环境(单机模式+伪分布模式）

Ubuntu上搭建Hadoop环境（单机模式+伪分布模式）

我的hadoop安装在/usr/local/hadoop下面。根据上面两篇文章的介绍，单机版和伪分布模式都可以工作。

2.运行例子

hadoop自带了些例子，可以运行一个wordcount例子。

列一下几个小步骤，以便今后记忆。

a. 初始化hadoop的环境变量。

b. 启动hadoop服务。

c. 查看hadoop是否启动

d. 运行wordcount

e.查看结果

注意：1. 单机版要把input和output目录放在/usr/local/hadoop下面。伪分布模式的input和output都在hdfs文件系统里。2. 如果第二次运行同一个例子，需要先删除output目录，或者在命令行里换一个输出目录，比如output2。

3.编译例子.

例子选用一个max temperature的例子。

code I modified which marked in red:

// cc MaxTemperature Application to find the maximum temperature in the weather dataset
// vv MaxTemperature
<span style="color:#FF0000;">import org.apache.hadoop.conf.Configuration;</span>
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
<span style="color:#FF0000;">import org.apache.hadoop.util.GenericOptionsParser;</span>
public class MaxTemperature {

  public static void main(String[] args) throws Exception {
	<span style="color:#FF0000;">Configuration conf = new Configuration();
	String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: maxtemperature <in> <out>");
      System.exit(2);
    }</span>

    
    <span style="color:#FF0000;">Job job = new Job(conf,"max temperature");</span>
    job.setJarByClass(MaxTemperature.class);
    <span style="color:#FF0000;">//job.setJobName("Max temperature");</span>

    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    
    job.setMapperClass(MaxTemperatureMapper.class);
    job.setReducerClass(MaxTemperatureReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
// ^^ MaxTemperature

3.1 准备工作：

a. 下载“hadoop权威指南第三版和配套源码”，例子在ch02目录里。

b. 安装eclipse。可以在ubuntu的software center里直接安装。

3.2 新建java工程

注意：1. 选择jdk为java环境。2.要把hadoop的jar加入到工程中，jar在hadoop目录和hadoop/lib目录下。使用图上的“add external JARs”。

3.3 用JAR 打包编译好的类，因为hadoop要运行jar文件。

jar -cvf 是打包，jar -tvf是看包的内容。

3.4 运行例子

数据文件sample.txt 要先放到hdfs的input里去。还有这里的MaxTemperature类的代码是修改过的，参照hadoop wordcount的代码修改的创建job的code。

4.hadoop streaming

4.1 python

ubuntu12.04自带的python2.7.3。所以可以直接运行。

注意在伪分布模式下要用 -file mapper.py -file reducer.py 把mapper和reducer两个文件传到集群里去。

4.2 c++

1. 首先要安装build-essential包，g++在里面。

2. 例子的Makefile需要修改如图，在LD link的时候需要加入库 -lcrypto -lssl。因此需要安装 libssl-dev。同时设置一下hadoop的地址和platform类型。

3. 运行是会出权限的错误，参考网上，基本都是要重编pipes和utils两个库。
hadoop pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input sample.txt \
-output output \
-program bin/max_temperature
注意 bin/max_temperature 是在hdfs里面的。

4. 重编utils库。要先编utils，因为pipes要用到utils的头文件。
./configure
make install
5. 重编pipes库。pipes的configure出现找不到libssl.so的错，需要把configure里的-lssl $LIBS 改为-lssl -lcrypto $LIBS，如下图。一共两处，都要改。

6. 用新编的utils和pipes库替换安装时的库和头文件（hadoop/c++/Linux-i386-32/)，重新编c++例子。

7. 再次运行。