Hadoop二

本文详细介绍了Hadoop2.x环境下HDFS的启动与配置,包括NameNode、DataNode、SecondaryNameNode的启动,以及对pid文件、数据存储目录的修改。此外,还展示了如何进行词频统计案例,包括数据准备、MapReduce执行过程及结果分析。
摘要由CSDN通过智能技术生成

hadoop2.x hdfs web界面 端口号50070

hadoop3.x hdfs web界面:http://localhost:9870/

服务若部署在阿里云Linux机器的话,在window上打开,http://本机外网IP:9870/

9870端口号要在【安全组】开放

bin/hdfs dfs -mkdir /user

 注:bin/hdfs dfs  中 /需要与/区别开

HDFS三个进程以localhost启动

bin/hdfs dfs -mkdir input

默认放在当前用户的家目录下面  /user/ssn

其实这一步在安装的时候已经操作过了,这边主要注意一下区别

ifconfig //查看当前机器ip地址

vi /etc/hosts

 NameNode以ocalhost启动

[ssn@localhost ~]$ cd app/hadoop/etc/hadoop
[ssn@localhost hadoop]$ vi core-site.xml 

hdfs://localhost:9000修改成当前主机名

 DataNode以localhost机器名称启动

[ssn@localhost hadoop]$ cat workers 

之前版本是slaves,奴隶变工人

修改为当前主机名

secondarynamenod以localhost启动

官方文档默认配置截图

  [ssn@localhost hadoop]$ vi hdfs-site.xml 
追加以下代码

    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>localhost:9868</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.https-address</name>
        <value>localhost:9869</value>
    </property>

        配置好之后,重启服务 

[ssn@localhost hadoop]$ cd ../../
[ssn@localhost hadoop]$ sbin/stop-dfs.sh 

 

[ssn@localhost hadoop]$ sbin/start-dfs.sh

 

/tmp目录 pid文件

进程启动会写一个

进程停止会从pid文件读取进程号,然后kill -9 进程号

如果pid文件丢失,在停止hadoop进程的时候,就无法kill这个进程了

结果:晚上维护,你认为更新配置或者更新jar(架包),DN重启生效了,其实DN压根就没有重启

[ssn@localhost hadoop]$ cd /tmp
[ssn@localhost tmp]$ cat hadoop-ssn-datanode.pid

[ssn@localhost tmp]$ cd hadoop-ssn

 数据存储目录也在/tmp目录,也很危险

[ssn@localhost tmp]$ ll hadoop-ssn
total 0
drwxrwxr-x. 5 ssn ssn 51 Nov 26 20:01 dfs
drwxr-xr-x. 5 ssn ssn 57 Dec 13 19:11 nm-local-dir

 目的:我们需要将上述/tmp目录移走,放到我们自己创建的/tmp目录中去,位置如图:

 官方默认的tmp目录配置如图

系统中实际显示的目录

 NameNode数据目录默认配置

 系统中实际数据目录

DataNode数据目录默认配置

 系统中实际数据目录:

secondarynamenode数据目录默认配置

  系统中实际数据目录:

 如何修改呢?

修改配置文件

[ssn@localhost ~]$ cd app/hadoop/etc/hadoop/
[ssn@localhost hadoop]$ vi core-site.xml 
追加
    <property>
        <name>hadoop.tmp.dir</name>
           <value>/home/ssn/tmp/hadoop-${user.name}</value>
    </property> 

 关闭进程

[ssn@localhost ~]$ cd app/hadoop
[ssn@localhost hadoop]$ sbin/stop-dfs.sh

移动当前用户的hadoop-ssn文件至创建的/tmp目录

[ssn@localhost tmp]$ cd ~
[ssn@localhost ~]$ ls
!  app  data  lib  log  software  source  tmp
[ssn@localhost ~]$ cd ~
[ssn@localhost ~]$ cd tmp
[ssn@localhost tmp]$ mv /tmp/hadoop-ssn
resourcemanager.pid  
[ssn@localhost tmp]$ mv /tmp/hadoop-ssn ./
[ssn@localhost tmp]$ ls
hadoop-ssn  hosts.swp

 

需要做格式化操作,不然NameNode无法启动

思维:

 注意:mv之后必须要做软链接,读取还是从老目录/tmp读取,实际存储在/home/ssn/tmp目录

但是已修改过配置文件,格式化之后,就可以正常启动了

格式化


[ssn@localhost hadoop]$ cd /tmp
[ssn@localhost tmp]$ rm -rf *.pid
[ssn@localhost tmp]$ cd
[ssn@localhost ~]$ cd app/hadoop
[ssn@localhost hadoop]$  bin/hdfs namenode -format //格式化

 下面修改pid配置文件,首先还是得停止进程之后,进行配置

 pid文件修改

[ssn@localhost ~]$ cd app/hadoop/etc/hadoop/
[ssn@localhost hadoop]$ vi hadoop-env.sh 

 

修改为

 

 重启之后,已生效

 yarn部署

ResourceManager RM

NodeManager   NM

在浏览器打开,先去阿里云【安全组】放开8088端口号

容易被挖矿或者中病毒,表现是登录机器和操作命令很卡,且有个进程占据cpu100%

官方配置文档默认参数

可以通过配置文件修改掉端口,端口号自己替换

[ssn@localhost ~]$ cd app/hadoop/etc/hadoop/
[ssn@localhost hadoop]$ vi yarn-site.xml
追加 <property>
        <name>yarn.resourcemanager.webapp.address</name>
           <value>${yarn.resourcemanager.hostname}:8123</value>
    </property> 

 启动yarn

web

全部进程以运行,但未配置

[ssn@localhost hadoop]$ jps

配置环境变量

 可以配置ssn的/home在个人变量里,便于操作

[ssn@localhost ~]$ cd
[ssn@localhost ~]$ vi .bashrc
追加
    export HADOOP_HOME=/home/ssn/app/hadoop
    export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

[ssn@localhost ~]$ . .bashrc   //生效一下

退出安全模式

[ssn@localhost ~]$ hadoop dfsadmin -safemode leave

案例 词频统计:

数据准备工作:

 找个架包跑一下

[ssn@localhost hadoop]$ find ./ -name '*example*'

./share/hadoop/tools/lib/hadoop-fs2img-3.2.2.jar
./share/hadoop/tools/lib/aliyun-sdk-oss-3.4.1.jar
./share/hadoop/tools/lib/hadoop-resourceestimator-3.2.2.jar
./share/hadoop/tools/lib/wildfly-openssl-1.0.7.Final.jar
./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar
./share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.2.2.jar
 

[ssn@localhost hadoop]$ yarn jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount /test /output1

2021-12-16 19:28:10,103 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-16 19:28:10,843 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ssn/.staging/job_1639649149849_0001
2021-12-16 19:28:11,111 INFO input.FileInputFormat: Total input files to process : 1
2021-12-16 19:28:12,018 INFO mapreduce.JobSubmitter: number of splits:1【切片是1 规则】
2021-12-16 19:28:12,678 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639649149849_0001
2021-12-16 19:28:12,679 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-16 19:28:12,898 INFO conf.Configuration: resource-types.xml not found
2021-12-16 19:28:12,898 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-12-16 19:28:13,403 INFO impl.YarnClientImpl: Submitted application application_1639649149849_0001
2021-12-16 19:28:13,440 INFO mapreduce.Job: The url to track the job: http://localhost:8123/proxy/application_1639649149849_0001/
2021-12-16 19:28:13,441 INFO mapreduce.Job: Running job: job_1639649149849_0001
2021-12-16 19:28:23,840 INFO mapreduce.Job: Job job_1639649149849_0001 running in uber mode : false
2021-12-16 19:28:23,845 INFO mapreduce.Job:  map 0% reduce 0%
2021-12-16 19:28:30,027 INFO mapreduce.Job:  map 100% reduce 0%
2021-12-16 19:28:35,065 INFO mapreduce.Job:  map 100% reduce 100%
2021-12-16 19:28:36,082 INFO mapreduce.Job: Job job_1639649149849_0001 completed successfully
2021-12-16 19:28:36,230 INFO mapreduce.Job: Counters: 54
    File System Counters
        FILE: Number of bytes read=123
        FILE: Number of bytes written=469331
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=148
        HDFS: Number of bytes written=73
        HDFS: Number of read operations=8
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
        HDFS: Number of bytes read erasure-coded=0
    Job Counters 
        Launched map tasks=1【map 任务 1】
        Launched reduce tasks=1【reduce 任务 1】
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=3581
        Total time spent by all reduces in occupied slots (ms)=2920
        Total time spent by all map tasks (ms)=3581
        Total time spent by all reduce tasks (ms)=2920
        Total vcore-milliseconds taken by all map tasks=3581
        Total vcore-milliseconds taken by all reduce tasks=2920
        Total megabyte-milliseconds taken by all map tasks=3666944
        Total megabyte-milliseconds taken by all reduce tasks=2990080
    Map-Reduce Framework
        Map input records=7
        Map output records=11
        Map output bytes=95
        Map output materialized bytes=123
        Input split bytes=97
        Combine input records=11
        Combine output records=11
        Reduce input groups=11
        Reduce shuffle bytes=123
        Reduce input records=11
        Reduce output records=11
        Spilled Records=22
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=153
        CPU time spent (ms)=780
        Physical memory (bytes) snapshot=274964480
        Virtual memory (bytes) snapshot=5519527936
        Total committed heap usage (bytes)=137498624
        Peak Map Physical memory (bytes)=182628352
        Peak Map Virtual memory (bytes)=2756386816
        Peak Reduce Physical memory (bytes)=92336128
        Peak Reduce Virtual memory (bytes)=2763141120
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=51
    File Output Format Counters 
        Bytes Written=73

[ssn@localhost hadoop]$ hdfs dfs -ls /output1
Found 2 items
-rw-r--r--   1 ssn supergroup          0 2021-12-16 19:28 /output1/_SUCCESS
-rw-r--r--   1 ssn supergroup         73 2021-12-16 19:28 /output1/part-r-00000
[ssn@localhost hadoop]$ hdfs dfs -cat /output1/part-r-00000
13	1
18lianwu	1
19zx	1
Eason	1
a	1
b	1
c	1
d	1
e	1
ssn	1
www.baidu.com	1
[ssn@localhost hadoop]$ 

 案例分析:

[ssn@localhost hadoop]$ hdfs dfs -cat /test/1.log
Eason
ssn
13
19zx
18lianwu
a b c d e
www.baidu.com
[ssn@localhost hadoop]$ 
第一步:每一行按空格拆分单词,且每个单词赋予默认值为1

(Eason,1)

(ssn,1)

(13,1)

(19zx,1)

(18lianwu,1)

(a,1)(b,1)(c,1)(d,1)(e,1)

。。。。

第二步ruduce:按单词维度,统计每个单词出现的次数

如果a出现了两次:

a:1+1  ==>  a  2

......

翻译成sql:select 单词,sum(value) from t group by 单词;

   

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值