hadoop初遇爬坑

最新推荐文章于 2021-09-11 17:01:38 发布

weixin_33717298

最新推荐文章于 2021-09-11 17:01:38 发布

阅读量176

点赞数

文章标签：大数据操作系统 python

原文链接：https://my.oschina.net/u/2474629/blog/3058593

版权

2019独角兽企业重金招聘Python工程师标准>>>

下载

在下面的地址下载字节需要的版本，如果是windows下建议先下载hadoop windows工具包看一下现在支持哪些版本，然后选择对应的hadoop版本。

hadoop下载 hadoop windows工具包

下载完之后解压：

hadoop-winutil

					hadoop windows工具包可选版本

hadoop-util-bin

					hadoop windows工具包bin目录

hadoop

					hadoop根目录

hadoop-bin

					hadoop bin目录

使用hadoop工具包的bin目录覆盖hadoop的bin目录，注意对应版本，一般大版本对上基本就没有问题。覆盖之前最好先备份。

配置文件

core-site.xml

<configuration>
    <property>       
        <name>fs.defaultFS</name>       
        <value>hdfs://localhost:9000</value>   
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/G:/datacenter/data30/tmp</value>
    </property>
</configuration>

主要配置的是hdfs这个文件系统的访问接口和临时目录。

hdfs-site.xml

<configuration>
    <property>       
        <name>dfs.replication</name>       
        <value>1</value>   
    </property>   
    <property>       
        <name>dfs.namenode.name.dir</name>       
        <value>/G:/datacenter/data30/namenode</value>   
    </property>   
    <property>       
        <name>dfs.datanode.data.dir</name>     
        <value>/G:/datacenter/data30/datanode</value>   
    </property>
</configuration>

配置hdfs系统的namenode目录和datanode目录。

mapred-site.xml

<configuration>   
    <property>       
        <name>mapreduce.framework.name</name>       
        <value>yarn</value>   
    </property>
</configuration>

yarn-site.xml

<configuration>   
    <property>       
        <name>yarn.nodemanager.aux-services</name>       
        <value>mapreduce_shuffle</value>   
    </property>   
    <property>       
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>   
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>   
    </property>
</configuration>

这里的yarn不是打包工具，而是一个资源调度器，YARN只提供运算资源的调度，用户程序向YARN申请资源，YARN就负责分配资源。

YARN中的主管角色叫ResourceManager，YARN中具体提供运算资源的角色叫NodeManager。

启动

把winutil的bin目录拷贝到hadoop的bin目录下，直接执行替换操作，建议替换之前先备份。然后把hadoop.dll动态链接文件拷贝一份到C:\Windows\System32目录下。

替换完成之后先配置一下环境变量，假设已经配置了JAVA的环境变量，就只需要配置HADOOP_HOME，指向hadoop的解压目录就可以了，然后path中加上hadoop的bin和sbin目录。

记得前面创建的目录，现在要先格式化hadoop：

hadoop namenode -format
hdfs namenode -format

上面2个命令任意一个，推荐下面一个，格式化成功之后，前面配置的namenode目录下会出现一个current文件夹。

现在就可以执行start-all.cmd脚本了，当然也可以先执行start-hdfs.xml，然后执行start-yarn.xml脚本。

start-hdfs.xml启动的是namenode和datanode，start-yarn.xml启动的是resourcemanager和nodemanager。

启动好之后执行一下jps命令，可能看到下面的内容：

JobTracker
SecondaryNameNode
NodeManager
ResourceManager
NameNode
DataNode

因为是单机，不是集群所以可能只有下面4个：

NodeManager
ResourceManager
NameNode
DataNode

可以访问： http://localhost:50070 来查看hdfs的web界面。

hadoop3.0开始访问： http://localhost:9870

也可以通过下面的配置修改(core-site.xml)：

<property>
  <name>dfs.namenode.http-address</name>
  <value>127.0.0.1:50070</value>
</property>

hdfs-web

可以访问： http://localhost:8088 来查看yarn的web界面，为了先用起来这里先不详细介绍，后面会补充一点内容。

yarn-web

hdfs命令与Java接口

hdfs基本命令

hadoop fs -rm -r /dir #删除
hadoop fs -ls -R /dir #列表
hadoop fs -mkdir /dir #创建目录
hadoop fs -put in.txt /tmp #上传
hadoop fs -get /tmp/in.txt out.txt #下载

我们可以看到hadoop fs命令和操作linux的命令基本一致，上面需要注意一点的就是文件上传和下载了。上传文件指定的目录一定要存在。

例如，上面的上传命令就会把当前目录下的in.txt文件上传到hdfs的/tmp目录下，使用列表命令就可以看到hdfs有一个/tmp/in.txt文件。

下载命令也一样，指定hdfs的路径，后一个是本地的路径。

hdfs Java接口

这里使用maven地方方式，先添加依赖：

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>${hadoop.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>${hadoop.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-mapreduce-client-core</artifactId>
    <version>${hadoop.version}</version>
</dependency>

这里的${hadoop.version}选择对应hadoop的版本就可以了。

下面是一个hdfs操作的简单示例：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.IOUtils;
import org.junit.Test;

import java.io.FileInputStream;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class HDFSUtilTest {

    private static final String PATH = "hdfs://localhost:9000/";

    private static final String DIR = "/dir";

    private static final String FILE = "/dir/hello2";

    public static final String IN = "G:\\tmp\\in2.txt";


    /**
     * 删除文件夹 hadoop fs -rmr /dir
     * @throws IOException
     */
    @Test
    public void remove() throws IOException, URISyntaxException {
        FileSystem fileSystem = getFileSystem();
        Path path = new Path(DIR);
        fileSystem.delete(path, true);
    }

    /**
     * 浏览文件夹 hadoop fs -lsr path
     * @throws IOException
     * @throws URISyntaxException
     */
    @Test
    public void list() throws IOException, URISyntaxException {
        FileSystem fileSystem = getFileSystem();
        Path root = new Path("/");
        FileStatus[] listStatus = fileSystem.listStatus(root);
        for (FileStatus fileStatus : listStatus) {
            String isDir = fileStatus.isDirectory() ? "文件夹" : "文件";
            String permission = fileStatus.getPermission().toString();
            int replication = fileStatus.getReplication();
            long len = fileStatus.getLen();
            String path = fileStatus.getPath().toString();
            System.out.println(isDir + "\t" + permission + "\t" + replication
                    + "\t" + len + "\t" + path);
        }
    }

    /**
     * 下载文件 hadoop fs -get src des
     * @throws IOException
     * @throws URISyntaxException
     */
    @Test
    public void getData() throws IOException, URISyntaxException {
        FileSystem fileSystem = getFileSystem();
//        String file = "/out/_SUCCESS";
        String file = "/out/part-r-00000";
        Path path = new Path(file);
        FSDataInputStream inputStream = fileSystem.open(path);
        IOUtils.copyBytes(inputStream, System.out, 1024, true);
    }

    /**
     * 上传文件 hadoop fs -put src des
     * @throws IOException
     * @throws URISyntaxException
     */
    @Test
    public void putData() throws IOException, URISyntaxException {
        FileSystem fileSystem = getFileSystem();
        Path path = new Path(FILE);
        FSDataOutputStream out = fileSystem.create(path);
        FileInputStream in = new FileInputStream(IN);
        IOUtils.copyBytes(in, out, 1024, true);
    }

    /**
     * 创建文件夹 hadoop fs -mkdir /dir
     * @throws IOException
     * @throws URISyntaxException
     */
    @Test
    public void mkDir() throws IOException, URISyntaxException {
        FileSystem fileSystem = getFileSystem();
        Path path = new Path(DIR);
        fileSystem.mkdirs(path);
    }

    private static FileSystem getFileSystem() throws IOException, URISyntaxException {
        URI uri = new URI(PATH);
        Configuration conf = new Configuration();
        FileSystem fileSystem = FileSystem.get(uri, conf);
        return fileSystem;
    }

}

hadoop mapreduce

下面是一个经典的hadoop mapreduce入门级示例。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;
import java.util.StringTokenizer;

public class WordCount {

    public static class TokenizerMapper extends Mapper<Object, Text,Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

        @Override
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            StringTokenizer stringTokenizer = new StringTokenizer(value.toString());
            while (stringTokenizer.hasMoreTokens()){
                word.set(stringTokenizer.nextToken());
                context.write(word,one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int sum = 0;
            for(IntWritable num : values){
                sum += num.get();
            }
            result.set(sum);
            context.write(key,result);
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration configuration = new Configuration();
        configuration.set("fs.default.name", "hdfs://localhost:9000");
        String[] remainingArgs = new GenericOptionsParser(configuration, args).getRemainingArgs();
        if(remainingArgs.length < 2){
            System.out.println("args error");
            System.exit(2);
        }
        Job job = Job.getInstance(configuration, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(remainingArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(remainingArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0:1);
    }
}

导包的时候会发现有mapreduce包和mapred包，暂时还没有了解这2个包有什么区别。

main接收2个参数，第一个参数是要处理文件的目录，第二个是处理结果的输出目录。

就是把第一个参数指定的目录下的文件使用TokenizerMapper这个类的map方法，就是使用StringTokenizer分词，然后使用IntSumReducer的reduce方法执行了一个汇总操作。

注意：这个mapreduce操作都在hdfs上进行的，所以可以先通过命令将要分析的文件上传到hdfs上。

另外指定目录的时候是不是以/开头的很重要，如果不是以/开头就会在前面拼接上系统属性user.dir的值。例如，指定的路径是"in/data.txt"，实际上会转换为/home/username/in/data.txt这个路径，如果是"/in/data.txt"，那么就是路径本身。

YARN

Scheduler

Scheduler是调度器，根据应用程序的资源需求进行资源分配，不参与应用程序具体的执行和监控等工作资源分配的单位就是Container，调度器是一个可插拔的组件，用户可以根据自己的需求实现自己的调度器。YARN 本身为我们提供了多种直接可用的调度器，比如 FIFO，Fair Scheduler 和Capacity Scheduler等。

ResourceManager

ResourceManager 是基于应用程序对集群资源的需求进行调度的 YARN 集群主控节点，负责协调和管理整个集群（所有 NodeManager）的资源，响应用户提交的不同类型应用程序的解析，调度，监控等工作。ResourceManager 会为每一个 Application 启动一个 MRAppMaster，并且MRAppMaster分散在各个NodeManager节点。

ResourceManager的职责：

处理客户端请求
启动或监控 MRAppMaster
监控 NodeManager
资源的分配与调度

NodeManager

NodeManager是YARN集群当中真正资源的提供者，是真正执行应用程序的容器的提供者，监控应用程序的资源使用情况，并通过心跳向集群资源调度器 ResourceManager 进行汇报以更新自己的健康状态。同时其也会监督Container的生命周期管理，监控每个 Container 的资源使用情况，追踪节点健康状况，管理日志和不同应用程序用到的附属服务。

NodeManager的职责：