Hadoop01

最新推荐文章于 2021-04-23 23:31:04 发布

initializeliu

最新推荐文章于 2021-04-23 23:31:04 发布

阅读量258

点赞数

分类专栏：大数据分布式开发文章标签： Hadoop

本文链接：https://blog.csdn.net/weixin_42581821/article/details/94960874

版权

大数据分布式开发专栏收录该内容

32 篇文章 1 订阅

订阅专栏

4@[toc]

什么是大数据

基本概念

在互联网技术发展到现今阶段，大量日常、工作等事务产生的数据都已经信息化，人类产生的数据量相比以前有了爆炸式的增长，以前的传统的数据处理技术已经无法胜任，需求催生技术，一套用来处理海量数据的软件工具应运而生，这就是大数据！
这些核心技术的实现是不需要用户从零开始造轮子的，存储和运算，都已经有大量的成熟的框架来用。

处理海量数据的核心技术：
海量数据存储：分布式
海量数据运算：分布式

存储框架：
HDFS——分布式文件存储系统（HADOOP中的存储框架）
HBASE——分布式数据库系统
KAFKA——分布式消息缓存系统(实时流式数据处理场景中应用广泛)

运算框架：（要解决的核心问题就是帮用户将处理逻辑在很多机器上并行执行）
MAPREDUCE—— 离线批处理（HADOOP中的运算框架）
SPARK —— 离线批处理/实时流式计算
STORM —— 实时流式计算

辅助类的工具（解放大数据工程师的一些繁琐工作）：
HIVE —— 数据仓库工具（可以接收sql，翻译成mapreduce或者spark程序运行）
FLUME——数据采集（自动采集日志文件）
SQOOP——数据迁移（将数据库中的数据迁移到HBase，hdfs中。将Hdfs中的数据迁移到数据库中）
ELASTIC SEARCH —— 分布式的搜索引擎

换个角度说，大数据是：
1、有海量的数据
2、有对海量数据进行挖掘的需求
3、有对海量数据进行挖掘的软件工具（hadoop、spark、storm、flink、tez、impala…）

应用场景

数据处理的最典型应用：公司的产品运营情况分析
电商推荐系统：基于海量的浏览行为、购物行为数据，进行大量的算法模型的运算，得出各类推荐结论，以供电商网站页面来为用户进行商品推荐
精准广告推送系统：基于海量的互联网用户的各类数据，统计分析，进行用户画像（得到用户的各种属性标签），然后可以为广告主进行有针对性的精准的广告投放

Hadoop

什么是hadoop

hadoop中有3个核心组件：
分布式文件系统：HDFS —— 实现将文件分布式存储在很多的服务器上
分布式运算编程框架：MAPREDUCE —— 实现在很多机器上分布式并行运算
分布式资源调度平台：YARN —— 帮用户调度大量的mapreduce程序，并合理分配运算资源

HDFS

hdfs：分布式文件系统

hdfs有着文件系统共同的特征：
1、有目录结构，顶层目录是： /
2、系统中存放的就是文件
3、系统可以提供对文件的：创建、删除、修改、查看、移动等功能

hdfs跟普通的单机文件系统有区别：
1、单机文件系统中存放的文件，是在一台机器的操作系统中
2、hdfs的文件系统会横跨N多的机器
3、单机文件系统中存放的文件，是在一台机器的磁盘上
4、hdfs文件系统中存放的文件，是落在n多机器的本地单机文件系统中（hdfs是一个基于linux本地文件系统之上的文件系统）

hdfs的工作机制：
在这里插入图片描述
1、客户把一个文件存入hdfs，其实hdfs会把这个文件切块后，分散存储在N台linux机器系统中（负责存储文件块的角色：data node）<准确来说：切块的行为是由客户端决定的>
2、一旦文件被切块存储，那么，hdfs中就必须有一个机制，来记录用户的每一个文件的切块信息，及每一块的具体存储机器（负责记录块信息的角色是：name node）
3、为了保证数据的安全性，hdfs可以将每一个文件块在集群中存放多个副本（到底存几个副本，是由当时存入该文件的客户端指定的）
综述：一个hdfs系统，由一台运行了namenode的服务器，和N台运行了datanode的服务器组成！

搭建hdfs分布式集群

hdfs集群组成结构：

在这里插入图片描述

安装hdfs集群的具体步骤：

一、首先需要准备N台linux服务器

学习阶段，用虚拟机即可！
先准备3台虚拟机：1个namenode节点 + 3 个datanode 节点

二、修改各台机器的主机名和ip地址

主机名：n1 对应的ip地址：192.168.145.201
主机名：n2 对应的ip地址：192.168.145.202
主机名：n3 对应的ip地址：192.168.145.203

在这里插入图片描述

三、从windows中用CRT软件进行远程连接

在windows中将各台linux机器的主机名配置到的windows的本地域名映射文件中：
c:/windows/system32/drivers/etc/hosts
192.168.145.201 n1
192.168.145.202 n2
192.168.145.203 n3

用crt连接上后，修改一下crt的显示配置（字号，编码集改为UTF-8）
在这里插入图片描述

四、配置linux服务器的基础软件环境

1.关闭防火墙
2.安装jdk
3.集群内主机名映射配置

五、安装hdfs集群

1、上传hadoop安装包到n1
tar -zxvf hadoop-3.2.0.tar.gz
2、修改配置文件
核心配置参数：
1) 指定hadoop的默认文件系统为：hdfs
2) 指定hdfs的namenode节点为哪台机器
3) 指定namenode软件存储元数据的本地目录
4) 指定datanode软件存放文件块的本地目录
hadoop的配置文件在：/appdata/hadoop/etc/hadoop/

1) 修改hadoop-env.sh
		export JAVA_HOME=/appdata/jdk
2) 修改core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://n1:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/appdata/hadoop/data</value>
        </property>
</configuration>

3) 修改hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>/data/hdpdata/name</value>
        </property>
        <property>
                <name>dfs.namenode.data.dir</name>
                <value>/data/hdpdata/data</value>
        </property>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>n2:50090</value>
        </property>
        <property>
                <name>dfs.http.address</name>
                <value>0.0.0.0:50070</value>
        </property>
</configuration>

4) 拷贝整个hadoop安装目录到其他机器
	scp -r hadoop-3.2.0 n3:/appdata
5) 启动HDFS
要运行hadoop的命令，需要在linux环境中配置HADOOP_HOME和PATH环境变量
vi /etc/profile
		export JAVA_HOME=/appdata/jdk
		export HADOOP_HOME=/appdata/hadoop
		export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
6)初始化namenode的元数据目录
	要在n1上执行hadoop的一个命令来初始化namenode的元数据存储目录
	hadoop namenode -format
	a.创建一个全新的元数据存储目录
	b.生成记录元数据的文件fsimage
	c.生成集群的相关标识：如：集群id——clusterID
7)用自动批量启动脚本来启动HDFS
	修改hadoop安装目录中/etc/hadoop/worker
		n1
		n2
		n3
8)在n1上用脚本：start-dfs.sh 来自动启动整个集群
9)jps检测

在这里插入图片描述

文件的切块大小和存储的副本数量，都是由客户端决定！所谓的由客户端决定，是通过配置参数来定的
hdfs的客户端会读以下两个参数，来决定切块大小、副本数量：
切块大小的参数： dfs.blocksize
副本数量的参数： dfs.replication

上面两个参数应该配置在客户端机器的hadoop目录中的hdfs-site.xml中配

dfs.blocksize
64m

dfs.replication
2

六.hdfs客户端的常用操作命令

1、上传文件到hdfs中
hadoop fs -put /本地文件 /aaa
2、下载文件到客户端本地磁盘
hadoop fs -get /hdfs中的路径 /本地磁盘目录
3、在hdfs中创建文件夹
hadoop fs -mkdir -p /aaa/xxx
4、移动hdfs中的文件（更名）
hadoop fs -mv /hdfs的路径1 /hdfs的另一个路径2
复制hdfs中的文件到hdfs的另一个目录
hadoop fs -cp /hdfs路径_1 /hdfs路径_2
5、删除hdfs中的文件或文件夹
hadoop fs -rm -r /aaa
6、查看hdfs中的文本文件内容
hadoop fs -cat /demo.txt
hadoop fs -tail -f /demo.txt

七.数据采集场景示意图

在这里插入图片描述

idea创建Java工程

在这里插入图片描述

项目创建成功！

lib目录下存放工程运行所需要的jar。
创建应用程序

package com.initialize;
import org.apache.log4j.Logger;
public class LoggerWrite {
    public static void main(String[] args) throws InterruptedException {
        while (true) {
            Logger logger = Logger.getLogger("logRollingFile");
            logger.info("111111111111111111111111111110----"+ System.currentTimeMillis());
            Thread.sleep(50);
        }
    }
}

#设置
#log4j.rootLogger=debug,stdout,genlog
log4j.rootLogger=logRollingFile,stdout

#控制台输出相应配置
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target=System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=[%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n

#日志文件输出格式配置
log4j.logger.logRollingFile= DEBUG,test1  
log4j.appender.test1 = org.apache.log4j.RollingFileAppender 
log4j.appender.test1.layout = org.apache.log4j.PatternLayout 
log4j.appender.test1.layout.ConversionPattern =%d{yyyy-MMM-dd HH:mm:ss}-[TS] %p %t %c - %m%n
log4j.appender.test1.Threshold = DEBUG 
log4j.appender.test1.ImmediateFlush = TRUE 
log4j.appender.test1.Append = TRUE 
#日志输出的文件目录
log4j.appender.test1.File = D:/testlog/access.log 
log4j.appender.test1.MaxFileSize = 64KB 
log4j.appender.test1.MaxBackupIndex = 200 
log4j.appender.test1.Encoding = UTF-8

因为使用了log4j日志技术，所以需要导入log4j.jar在lib下。
在这里插入图片描述
运行程序得到结果：
控制台：

日志目录：

在日志目录，access.log.5是生成最早的日志文件。access.log是正在生成的日志文件，当access.log的文件大小到指定大小时，access.log.5改名为access.log.6,一次改名，access.log改名为access.log.1。再次创建一个access.log，接收新生成的日志。

日志文件内容：
在这里插入图片描述

Hdfs在Windows本地客户端使用

环境搭建

jar依赖

common下的jar包
在这里插入图片描述
hdfs下的jar包

Windows本地安装hadoop

在这里插入图片描述
将Hadoop配置到环境变量中。
hadoop/bin

缺少C语言一些文件。

上面是所有的依赖。
运行hdfs客户端不需要所有文件。

程序相关源代码

package com.initialize;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.net.URI;
import java.util.Arrays;

public class HdfsClientDemo {

    public static void main(String[] args) throws Exception {
        /**
         * Configuration参数对象的机制：
         *    构造时，会加载jar包中的默认配置 xx-default.xml
         *    再加载 用户配置xx-site.xml  ，覆盖掉默认参数
         *    构造完成之后，还可以conf.set("p","v")，会再次覆盖用户配置文件中的参数值
         */
        // new Configuration()会从项目的classpath中加载core-default.xml hdfs-default.xml core-site.xml hdfs-site.xml等文件
        Configuration conf = new Configuration();

        // 指定本客户端上传文件到hdfs时需要保存的副本数为：2
        conf.set("dfs.replication", "2");
        // 指定本客户端上传文件到hdfs时切块的规格大小：64M
        conf.set("dfs.blocksize", "64m");

        // 构造一个访问指定HDFS系统的客户端对象: 参数1:——HDFS系统的URI，参数2：——客户端要特别指定的参数，参数3：客户端的身份（用户名）
        FileSystem fs = FileSystem.get(new URI("hdfs://n1:9000/"), conf, "lys");

        // 上传一个文件到HDFS中
        fs.copyFromLocalFile(new Path("C:\\Users\\3D Objects\\火车引擎.3mf"), new Path("/aaa/"));

        fs.close();
    }

    FileSystem fs = null;

    @Before
    public void init() throws Exception{
        Configuration conf = new Configuration();
        conf.set("dfs.replication", "2");
        conf.set("dfs.blocksize", "64m");

        fs = FileSystem.get(new URI("hdfs://hdp-01:9000/"), conf, "root");
    }


    /**
     * 从HDFS中下载文件到客户端本地磁盘
     * @throws IOException
     * @throws IllegalArgumentException
     */
    @Test
    public void testGet() throws IllegalArgumentException, IOException {

        fs.copyToLocalFile(new Path("/hdp20-05.txt"), new Path("f:/"));
        fs.close();

    }


    /**
     * 在hdfs内部移动文件\修改名称
     */
    @Test
    public void testRename() throws Exception{

        fs.rename(new Path("/install.log"), new Path("/aaa/in.log"));

        fs.close();

    }

    /**
     * 在hdfs中创建文件夹
     */
    @Test
    public void testMkdir() throws Exception{

        fs.mkdirs(new Path("/xx/yy/zz"));

        fs.close();
    }


    /**
     * 在hdfs中删除文件或文件夹
     */
    @Test
    public void testRm() throws Exception{

        fs.delete(new Path("/aaa"), true);

        fs.close();
    }



    /**
     * 查询hdfs指定目录下的文件信息
     */
    @Test
    public void testLs() throws Exception{
        // 只查询文件的信息,不返回文件夹的信息
        RemoteIterator<LocatedFileStatus> iter = fs.listFiles(new Path("/"), true);

        while(iter.hasNext()){
            LocatedFileStatus status = iter.next();
            System.out.println("文件全路径："+status.getPath());
            System.out.println("块大小："+status.getBlockSize());
            System.out.println("文件长度："+status.getLen());
            System.out.println("副本数量："+status.getReplication());
            System.out.println("块信息："+ Arrays.toString(status.getBlockLocations()));

            System.out.println("--------------------------------");
        }
        fs.close();
    }

    /**
     * 查询hdfs指定目录下的文件和文件夹信息
     */
    @Test
    public void testLs2() throws Exception{
        FileStatus[] listStatus = fs.listStatus(new Path("/"));

        for(FileStatus status:listStatus){
            System.out.println("文件全路径："+status.getPath());
            System.out.println(status.isDirectory()?"这是文件夹":"这是文件");
            System.out.println("块大小："+status.getBlockSize());
            System.out.println("文件长度："+status.getLen());
            System.out.println("副本数量："+status.getReplication());

            System.out.println("--------------------------------");
        }
        fs.close();
    }
	/**
     * 读取hdfs中的文件的内容
     *
     */
    @Test
    public void testReadData() throws Exception{
        FSDataInputStream in = fs.open(new Path("/test.txt"));

        BufferedReader br = new BufferedReader(new InputStreamReader(in, "utf-8"));

        String line = null;
        while((line = br.readLine()) != null){
            System.out.println(line);
        }

        br.close();
        in.close();
        fs.close();
    }

    /**
     * 读取hdfs中文件的指定偏移量范围的内容
     * 用本例中的知识，实现读取一个文本文件中指定的BLOCK块中内容
     */
    @Test
    public void testRandomReadData() throws Exception{

        FSDataInputStream in = fs.open(new Path("/test.txt"));

        //将读取的起始位置进行指定
        in.seek(12);
        //读取16个字节
        byte[] buf = new byte[16];
        in.read(buf);

        System.out.println(new String(buf));

        in.close();
        fs.close();
    }
    /**
     * 往hdfs中的文件中写内容
     *
     */
    @Test
    public void testWriteData() throws Exception{

        FSDataOutputStream out = fs.create(new Path("/zz.jpg"));

        FileInputStream in = new FileInputStream("C:\\Users\\Pictures\\Camera Roll\\b3119313b07eca8006930d9b922397dda1448333.jpg");

        byte[] buf = new byte[1024];
        int read = 0;
        while((read = in.read(buf)) != -1){
            out.write(buf, 0, read);
        }

        in.close();
        out.close();
        fs.close();
    }
    /**
     * 读取hdfs中的文件的内容
     *
     */
    @Test
    public void testReadData() throws Exception{
        FSDataInputStream in = fs.open(new Path("/test.txt"));

        BufferedReader br = new BufferedReader(new InputStreamReader(in, "utf-8"));

        String line = null;
        while((line = br.readLine()) != null){
            System.out.println(line);
        }

        br.close();
        in.close();
        fs.close();
    }

    /**
     * 读取hdfs中文件的指定偏移量范围的内容
     * 用本例中的知识，实现读取一个文本文件中指定的BLOCK块中内容
     */
    @Test
    public void testRandomReadData() throws Exception{

        FSDataInputStream in = fs.open(new Path("/test.txt"));

        //将读取的起始位置进行指定
        in.seek(12);
        //读取16个字节
        byte[] buf = new byte[16];
        in.read(buf);

        System.out.println(new String(buf));

        in.close();
        fs.close();
    }
    /**
     * 往hdfs中的文件中写内容
     *
     */
    @Test
    public void testWriteData() throws Exception{

        FSDataOutputStream out = fs.create(new Path("/zz.jpg"));

        FileInputStream in = new FileInputStream("C:\\Users\\Pictures\\Camera Roll\\b3119313b07eca8006930d9b922397dda1448333.jpg");

        byte[] buf = new byte[1024];
        int read = 0;
        while((read = in.read(buf)) != -1){
            out.write(buf, 0, read);
        }

        in.close();
        out.close();
        fs.close();
    }

}

windows本地配置文件
在这里插入图片描述

<configuration>
<property>
    <name>dfs.replication</name>
    <value>4</value>
</property>
<property>
    <name>dfs.blocksize</name>
    <value>16m</value>
</property>
</configuration>