Intellij IDEA远程向虚拟机hadoop集群提交作业(好多坑)

最新推荐文章于 2024-04-28 04:46:43 发布

iammsw

最新推荐文章于 2024-04-28 04:46:43 发布

阅读量1.7k

点赞数 5

分类专栏： Hadoop 文章标签： Hadoop IDEA

本文链接：https://blog.csdn.net/weixin_42741271/article/details/102539760

版权

Hadoop 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近在学习数据分析，用到了hadoop和spark。之前在虚拟机配置好了hadoop集群，今天想尝试一下在win10环境下，利用 IDEA 远程向虚拟机上的hadoop集群提交作业（以WordCount为例）

一：环境以及准备工作：

win10 + IntelliJ IDEA 2017.1.6 + hadoop 2.8.0
注意：hadoop在虚拟机和本地都要安装，安装步骤二者几乎一样，就不写了，不会的去百度。win10安装好hadoop之后同样需要配置环境变量：
虚拟机上 Hadoop 集群，这个和根据你自己的配置，把这三行代码粘贴到 C:\Windows\System32\drivers\etc\hosts

# 这贴的是我自己的配置
192.168.253.100 centos
192.168.253.101 server1
192.168.253.102 server2

二：IDEA创建项目

新建 Maven项目

选好java版本，然后next
这个随便填吧

然后 next，选择你自己的项目目录，Finish
添加依赖
打开 pom.xml，注意这几个hadoop有关的，版本要填你自己的，我的是2.8.0

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.msw</groupId>
    <artifactId>test1</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.0</version>
        </dependency>
    </dependencies>
</project>

添加配置文件
这个和你在虚拟机配置集群时类似，直接去把你虚拟机hadoop下的：core-site.xml、mapred-site.xml、yarn-site.xml文件拷贝到 idea项目下的 resources文件夹下。我的是这样的：
core-site.xml：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://centos:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
    </property>

</configuration>

mapred-site.xml：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>centos:49001</value>
    </property>
    <property>
        <name>mapred.local.dir</name>
        <value>/root/hadoop/var</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.app-submission.cross-platform</name>
        <value>true</value>
    </property>
</configuration>

yarn-site.xml：

<?xml version="1.0"?>
<configuration>
    <!-- Site specific YARN configuration properties -->
       <property>
           <name>yarn.resourcemanager.hostname</name>
           <value>centos</value>
           </property>
       <property>
            <description>The address of the applications manager interface in the RM.</description>
            <name>yarn.resourcemanager.address</name>
            <value>${yarn.resourcemanager.hostname}:8032</value>
       </property>
       <property>
            <description>The address of the scheduler interface.</description>
            <name>yarn.resourcemanager.scheduler.address</name>
            <value>${yarn.resourcemanager.hostname}:8030</value>
       </property>
       <property>
            <description>The http address of the RM web application.</description>
            <name>yarn.resourcemanager.webapp.address</name>
            <value>${yarn.resourcemanager.hostname}:8088</value>
       </property>
       <property>
            <description>The https adddress of the RM web application.</description>
            <name>yarn.resourcemanager.webapp.https.address</name>
            <value>${yarn.resourcemanager.hostname}:8090</value>
       </property>
       <property>
            <name>yarn.resourcemanager.resource-tracker.address</name>
            <value>${yarn.resourcemanager.hostname}:8031</value>
       </property>
       <property>
            <description>The address of the RM admin interface.</description>
            <name>yarn.resourcemanager.admin.address</name>
            <value>${yarn.resourcemanager.hostname}:8033</value>
       </property>
       <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
       </property>
       <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2048</value>
            <discription>每个节点可用内存,单位MB,默认8182MB</discription>
       </property>
       <property>
            <name>yarn.nodemanager.vmem-pmem-ratio</name>
            <value>2.1</value>
       </property>
       <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>2048</value>
</property>
       <property>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
</property>
</configuration>

core-site.xml：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://centos:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/root/hadoop/tmp</value>
    </property>

</configuration>

然后是 log4j.properties:

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} | %-5.5p | %-16.16t | %-32.32c{1} | %-32.32C %4L | %m%n

编写 WordCount.java

package com.msw;
/*
 * File: WordCount.java
 * Date: 2019/10/13-20:40
 * Author: msw.
 * PS ...
*/

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

    public static class TokenizerMapper extends
            Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends
            Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
        if (otherArgs.length != 2) {
            System.err.println("Usage: wordcount <in> <out>");
            System.exit(2);
        }

        Job job = new Job(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setJar("WordCount.jar");	// 注意这个是你待会要打包的jar
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);

        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        job.setNumReduceTasks(2);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

打包，测试

打包出 WordCount.jar：
ctrl+alt+shift+s 打开项目配置页面
点击 Artifacts => +号 => JAR => Empty
然后名字输入WordCount

然后点中间的绿色按钮，选择 module output 选项，确认创建

OK，应用，退出来
找到顶部工具栏的 Build => Build Artifacts，会弹出这个

Build，然后你的项目下会多出一个 out 文件夹，打开，下面有你刚刚打包的 WordCount.jar
把这个 WordCount.jar 复制粘贴到项目总的目录下。
完了整个项目目录结构是这样的：

ok，已经快要结束了
Run => Edit Configurations

Add 一个 Application，然后配置如下：
需要你修改的是：Main class、Program arhuments

注意这个 program arguments

hdfs://centos:9000/input/word.txt
hdfs://centos:9000/output

centos 是我的主机名，即hadoop集群的master机名（hostname）；

/input/word.txt 是分布式文件系统 hdfs 的一个文件，需要你事先去上传，测试用的，随便上传一个txt文件就ok；

/output 也是hdfs下的一个目录（未创建），你运行程序时他有可能会报错，告诉你这个output文件夹已存在，你把他删了就可以了 hdfs dfs -rm -r /output

OK ! 大功告成，可以点击Run测试了，记得先启动hadoop集群嗷~

iammsw

关注

5
点赞
踩
24

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录