mac os下hadoop配置以及eclispse简单代码的运行与调试

最新推荐文章于 2022-05-20 16:38:35 发布

phnfn

最新推荐文章于 2022-05-20 16:38:35 发布

阅读量581

点赞数

分类专栏：大数据文章标签： mac os hadoop mapreduce

本文链接：https://blog.csdn.net/phnfn/article/details/79080958

版权

大数据专栏收录该内容

4 篇文章 0 订阅

订阅专栏

2016年版的MBP mac os high sierra 16G，想要学习大数据基础，参考书目包括《hadoop权威指南》和于博的讲课视频，区别于其它教程，我想用最简单的方式在mac上搭建hadoop系统

一、mac os下brew的安装
mac os下的homebrew是mac体系下软件管理软件，安装简单，使用方便。
在安装hadoop，maven，eclipse，jdk的过程中可以帮助解决许多兼容性问题
网站主页：https://brew.sh/index.html
安装方法：/usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)”
二、jdk的安装
根据自己的情况，一些老版的mac os包含有1.6的jdk。而1.8的需要自己来安装，目前oracle上有为mac os发布的jdk版本，dmg格式，解压后可能是tar.gz的包，直接解压，双击安装器直接安装。
http://www.oracle.com/technetwork/java/javase/downloads/jdk-netbeans-jsp-142931.html
三、hadoop的安装
hadoop还是推荐安装2.6.0版本，主要是使用的人多，eclipse的插件配置兼容问题比较容易解决。
在线文档：http://hadoop.apache.org/docs/r2.6.0/index.html
下载地址：http://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
由于mac系统权限的问题，不建议将其放置与系统文件中，我的做法是直接放置在本地登录用户的主文件夹下。
cp hadoop-2.6.0.tar.gz /User/XXX(用户名)/
tar -zxvf hadoop-2.6.0.tar.gz
修改hadoop安装目录中的etc下的配置文件

core-site.xml
<configuration>
        <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
        </property>

mapred.xml
    <configuration>
        <property>
            <name>mapred.job.tracker</name>
            <value>hdfs://localhost:9001</value>
        </property>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
    </configuration>
hdfs-site.xml
    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>1</value>
        </property>
        <property>
            <name>dfs.namenode.name.dir</name>
            <value>/users/peihuaining/hadoop-2.6.0/dfs/name</value>
        </property>
        <property>
            <name>dfs.datannode.data.dir</name>
            <value>/users/peihuaining/hadoop-2.6.0/dfs/data</value>
      </property>
    </configuration>
yarn-site.xml
    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
    </configuration>

初始化hdfs文件系统
bin/Hadoop NameNode -format
开启hadoop守护进程
bin/start-all.sh
本地浏览器测试一下hdfs系统和resourcemanager和nodemanager
HDFS http://localhost:8088/
RESOURCEMANAGER http://localhost:50070/
四、eclipse以及hadoop插件的的安装
不编译安装最新，目前建议使用4.5版本（MARS），需要网上去下
安装后下载插件
https://raw.githubusercontent.com/winghc/hadoop2x-eclipse-plugin/master/release/hadoop-eclipse-plugin-2.6.0.jar
将插件放置在eclipse下的plugins文件下，如果eclipse是直接拖入Application中的情况，右键点击eclipse显示包信息
启动eclipse，会看见dfs location选项，map/reduce master 块中填9001，dfs master中填9000，然后刷新一下，就可以看到hdfs中的文件。
五、hadoop示例
我们借鉴了原博主“不想下火车的人”撰写的“eclipse配置hadoop2.7.2开发环境并本地跑起来”（https://www.cnblogs.com/wuxun1997/p/6849878.html）。在此例中，他使用了文本分词统计方法对“人民的名义”这本书进行词条分析。我认为比一般的单纯统计字母数量的hello版本更具备代表性。
需要注意的有两个问题。
1、是他使用了分词工具IKAnalyzer，需要使用6.5.0版本（http://dl.download.csdn.net/down11/20170328/0865c3272e945b7501865cb7f8b3ee40.jar?response-content-disposition=attachment%3Bfilename%2A%3D%22utf8%27%27IKAnalyzer6.5.0.jar%22&OSSAccessKeyId=9q6nvzoJGowBj4q1&Expires=1517303060&Signature=eG09niUTLwFPPvY3%2FdRDv9oRtKQ%3D&user=phnfn&sourceid=9796612&sourcescore=5&isvip=1&phnfn&9796612），低版本没有智能分析的构造函数，会导致程序错误。
2、运行时错误：找不到jar包，job初始化失败的错误，查询了一下大家的帖子，好像是命的问题。比较通用的方法还是导出jar包后手动添加到项目中。
我调试通过的源码：

package com.wulinfeng.hadoop.wordsplit;
import java.io.IOException;
import java.io.StringReader;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.map.InverseMapper;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

public class WordSplit {

/**
 * map实现分词
 * @author Administrator
 *
 */
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {
    private static final IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {
        StringReader input = new StringReader(value.toString());
        IKSegmenter ikSeg = new IKSegmenter(input, true); // 智能分词
        //IKSegmenter ikSeg = new IKSegmenter(input, null);
        for (Lexeme lexeme = ikSeg.next(); lexeme != null; lexeme = ikSeg.next()) {
            this.word.set(lexeme.getLexemeText());
            context.write(this.word, one);
        }
    }
}

/**
 * reduce实现分词累计
 * @author Administrator
 *
 */
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
            Reducer<Text, IntWritable, Text, IntWritable>.Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        this.result.set(sum);
        context.write(key, this.result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    conf.set("mapred.jar", "WordSplit.jar"); 
    String inputFile = "hdfs://localhost:9000/input/people.txt"; // 输入文件
    Path outDir = new Path("hdfs://localhost:9000/out"); // 输出目录
    Path tempDir = new Path("hdfs://localhost:9000/tmp" + System.currentTimeMillis()); // 临时目录

    // 第一个任务：分词
    System.out.println("start task...");
    Job job = Job.getInstance(conf, "word split");
    job.setJarByClass(WordSplit.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(inputFile));
    FileOutputFormat.setOutputPath(job, tempDir);
    System.out.println("all is ok?");
    // 第一个任务结束，输出作为第二个任务的输入，开始排序任务
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    System.out.println("setoutformat ok");
    if (job.waitForCompletion(true)) {
        System.out.println("start sort...");
        Job sortJob = Job.getInstance(conf, "word sort");
        sortJob.setJarByClass(WordSplit.class);
        sortJob.setMapperClass(InverseMapper.class);
        sortJob.setInputFormatClass(SequenceFileInputFormat.class);

        // 反转map键值，计算词频并降序
        sortJob.setMapOutputKeyClass(IntWritable.class);
        sortJob.setMapOutputValueClass(Text.class);
        sortJob.setSortComparatorClass(IntWritableDecreasingComparator.class);
        sortJob.setNumReduceTasks(1);

        // 输出到out目录文件
        sortJob.setOutputKeyClass(IntWritable.class);
        sortJob.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(sortJob, tempDir);

        // 如果已经有out目录，先删再创建
        FileSystem fileSystem = outDir.getFileSystem(conf);
        if (fileSystem.exists(outDir)) {
            fileSystem.delete(outDir, true);
        }
        FileOutputFormat.setOutputPath(sortJob, outDir);

        if (sortJob.waitForCompletion(true)) {
            System.out.println("finish and quit....");
            // 删掉临时目录
            fileSystem = tempDir.getFileSystem(conf);
            if (fileSystem.exists(tempDir)) {
                fileSystem.delete(tempDir, true);
            }
            System.exit(0);
        }
    }
    else{
        System.out.println("job init errors!");
    }
}

/**
 * 实现降序
 * 
 * @author Administrator
 *
 */
private static class IntWritableDecreasingComparator extends IntWritable.Comparator {
    public int compare(WritableComparable a, WritableComparable b) {
        return -super.compare(a, b);
    }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
            return -super.compare(b1, s1, l1, b2, s2, l2);
        }
    }
}

IKAnalyzer可以设定为智能分词模式，但对于人名之类的东西，它会裁成单字，所以需要设定自己的词典，在项目中加入配置文件IKAnalyzer.cfg.xml，参考“不想下火车的人”的博客，不另给出。
六、运行

start task...
2018-02-02 11:48:20,088 WARN [org.apache.hadoop.util.NativeCodeLoader] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
all is ok?
setoutformat ok
2018-02-02 11:48:20,957 WARN [org.apache.hadoop.mapreduce.JobSubmitter] - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
加载扩展词典：myextutf8.dic
加载扩展停止词典：mystopwordutf8.dic
start sort...
2018-02-02 11:48:26,340 WARN [org.apache.hadoop.mapreduce.JobSubmitter] - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
finish and quit....

结果：

1357    侯亮平
781 李达康
627 高育良
553 祁同伟
525 蔡成功
518 赵
493 老师
492 有
460 地
449 这
420 让
399 书记
393 一个
390 呢
379 把
365 不
。。。。。。。。。。。。

七、问题
1、每次调试需先开启hadoop环境，每次结束需关闭，否则很容易产生文件系统损坏。如果发生namenode进入safemode的问题，使用以下方法予以解决：

步骤 1 执行命令退出安全模式：

hadoop dfsadmin -safemode leave

步骤 2 执行健康检查，删除损坏掉的block。

hdfs fsck  /  -delete

注意: 这种方式会出现数据丢失，损坏的block会被删掉
2、输入输出文本的编码格式，建议使用utf8转码，可使用icov。

phnfn

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录