hadoop2.x hdfs web界面 端口号50070
hadoop3.x hdfs web界面:http://localhost:9870/
服务若部署在阿里云Linux机器的话,在window上打开,http://本机外网IP:9870/
9870端口号要在【安全组】开放
bin/hdfs dfs -mkdir /user
注:bin/hdfs dfs 中 /需要与/区别开
HDFS三个进程以localhost启动
bin/hdfs dfs -mkdir input
默认放在当前用户的家目录下面 /user/ssn
其实这一步在安装的时候已经操作过了,这边主要注意一下区别
ifconfig //查看当前机器ip地址
vi /etc/hosts
NameNode以ocalhost启动
[ssn@localhost ~]$ cd app/hadoop/etc/hadoop
[ssn@localhost hadoop]$ vi core-site.xml
hdfs://localhost:9000修改成当前主机名
DataNode以localhost机器名称启动
[ssn@localhost hadoop]$ cat workers
之前版本是slaves,奴隶变工人
修改为当前主机名
secondarynamenod以localhost启动
官方文档默认配置截图
[ssn@localhost hadoop]$ vi hdfs-site.xml
追加以下代码
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>localhost:9868</value>
</property>
<property>
<name>dfs.namenode.secondary.https-address</name>
<value>localhost:9869</value>
</property>
配置好之后,重启服务
[ssn@localhost hadoop]$ cd ../../
[ssn@localhost hadoop]$ sbin/stop-dfs.sh
[ssn@localhost hadoop]$ sbin/start-dfs.sh
/tmp目录 pid文件
进程启动会写一个
进程停止会从pid文件读取进程号,然后kill -9 进程号
如果pid文件丢失,在停止hadoop进程的时候,就无法kill这个进程了
结果:晚上维护,你认为更新配置或者更新jar(架包),DN重启生效了,其实DN压根就没有重启
[ssn@localhost hadoop]$ cd /tmp
[ssn@localhost tmp]$ cat hadoop-ssn-datanode.pid
[ssn@localhost tmp]$ cd hadoop-ssn
数据存储目录也在/tmp目录,也很危险
[ssn@localhost tmp]$ ll hadoop-ssn
total 0
drwxrwxr-x. 5 ssn ssn 51 Nov 26 20:01 dfs
drwxr-xr-x. 5 ssn ssn 57 Dec 13 19:11 nm-local-dir
目的:我们需要将上述/tmp目录移走,放到我们自己创建的/tmp目录中去,位置如图:
官方默认的tmp目录配置如图
系统中实际显示的目录
NameNode数据目录默认配置
系统中实际数据目录
DataNode数据目录默认配置
系统中实际数据目录:
secondarynamenode数据目录默认配置
系统中实际数据目录:
如何修改呢?
修改配置文件
[ssn@localhost ~]$ cd app/hadoop/etc/hadoop/
[ssn@localhost hadoop]$ vi core-site.xml
追加
<property>
<name>hadoop.tmp.dir</name>
<value>/home/ssn/tmp/hadoop-${user.name}</value>
</property>
关闭进程
[ssn@localhost ~]$ cd app/hadoop
[ssn@localhost hadoop]$ sbin/stop-dfs.sh
移动当前用户的hadoop-ssn文件至创建的/tmp目录
[ssn@localhost tmp]$ cd ~
[ssn@localhost ~]$ ls
! app data lib log software source tmp
[ssn@localhost ~]$ cd ~
[ssn@localhost ~]$ cd tmp
[ssn@localhost tmp]$ mv /tmp/hadoop-ssn
resourcemanager.pid
[ssn@localhost tmp]$ mv /tmp/hadoop-ssn ./
[ssn@localhost tmp]$ ls
hadoop-ssn hosts.swp
需要做格式化操作,不然NameNode无法启动
思维:
注意:mv之后必须要做软链接,读取还是从老目录/tmp读取,实际存储在/home/ssn/tmp目录
但是已修改过配置文件,格式化之后,就可以正常启动了
格式化
[ssn@localhost hadoop]$ cd /tmp
[ssn@localhost tmp]$ rm -rf *.pid
[ssn@localhost tmp]$ cd
[ssn@localhost ~]$ cd app/hadoop
[ssn@localhost hadoop]$ bin/hdfs namenode -format //格式化
下面修改pid配置文件,首先还是得停止进程之后,进行配置
pid文件修改
[ssn@localhost ~]$ cd app/hadoop/etc/hadoop/
[ssn@localhost hadoop]$ vi hadoop-env.sh
修改为
重启之后,已生效
yarn部署
ResourceManager RM
NodeManager NM
在浏览器打开,先去阿里云【安全组】放开8088端口号
容易被挖矿或者中病毒,表现是登录机器和操作命令很卡,且有个进程占据cpu100%
官方配置文档默认参数
可以通过配置文件修改掉端口,端口号自己替换
[ssn@localhost ~]$ cd app/hadoop/etc/hadoop/
[ssn@localhost hadoop]$ vi yarn-site.xml
追加 <property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8123</value>
</property>
启动yarn
web
全部进程以运行,但未配置
[ssn@localhost hadoop]$ jps
配置环境变量
可以配置ssn的/home在个人变量里,便于操作
[ssn@localhost ~]$ cd
[ssn@localhost ~]$ vi .bashrc
追加
export HADOOP_HOME=/home/ssn/app/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
[ssn@localhost ~]$ . .bashrc //生效一下
退出安全模式
[ssn@localhost ~]$ hadoop dfsadmin -safemode leave
案例 词频统计:
数据准备工作:
找个架包跑一下
[ssn@localhost hadoop]$ find ./ -name '*example*'
./share/hadoop/tools/lib/hadoop-fs2img-3.2.2.jar
./share/hadoop/tools/lib/aliyun-sdk-oss-3.4.1.jar
./share/hadoop/tools/lib/hadoop-resourceestimator-3.2.2.jar
./share/hadoop/tools/lib/wildfly-openssl-1.0.7.Final.jar
./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar
./share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-3.2.2.jar
[ssn@localhost hadoop]$ yarn jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.2.jar wordcount /test /output1
2021-12-16 19:28:10,103 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2021-12-16 19:28:10,843 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/ssn/.staging/job_1639649149849_0001
2021-12-16 19:28:11,111 INFO input.FileInputFormat: Total input files to process : 1
2021-12-16 19:28:12,018 INFO mapreduce.JobSubmitter: number of splits:1【切片是1 规则】
2021-12-16 19:28:12,678 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1639649149849_0001
2021-12-16 19:28:12,679 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-12-16 19:28:12,898 INFO conf.Configuration: resource-types.xml not found
2021-12-16 19:28:12,898 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-12-16 19:28:13,403 INFO impl.YarnClientImpl: Submitted application application_1639649149849_0001
2021-12-16 19:28:13,440 INFO mapreduce.Job: The url to track the job: http://localhost:8123/proxy/application_1639649149849_0001/
2021-12-16 19:28:13,441 INFO mapreduce.Job: Running job: job_1639649149849_0001
2021-12-16 19:28:23,840 INFO mapreduce.Job: Job job_1639649149849_0001 running in uber mode : false
2021-12-16 19:28:23,845 INFO mapreduce.Job: map 0% reduce 0%
2021-12-16 19:28:30,027 INFO mapreduce.Job: map 100% reduce 0%
2021-12-16 19:28:35,065 INFO mapreduce.Job: map 100% reduce 100%
2021-12-16 19:28:36,082 INFO mapreduce.Job: Job job_1639649149849_0001 completed successfully
2021-12-16 19:28:36,230 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=123
FILE: Number of bytes written=469331
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=148
HDFS: Number of bytes written=73
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1【map 任务 1】
Launched reduce tasks=1【reduce 任务 1】
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3581
Total time spent by all reduces in occupied slots (ms)=2920
Total time spent by all map tasks (ms)=3581
Total time spent by all reduce tasks (ms)=2920
Total vcore-milliseconds taken by all map tasks=3581
Total vcore-milliseconds taken by all reduce tasks=2920
Total megabyte-milliseconds taken by all map tasks=3666944
Total megabyte-milliseconds taken by all reduce tasks=2990080
Map-Reduce Framework
Map input records=7
Map output records=11
Map output bytes=95
Map output materialized bytes=123
Input split bytes=97
Combine input records=11
Combine output records=11
Reduce input groups=11
Reduce shuffle bytes=123
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=153
CPU time spent (ms)=780
Physical memory (bytes) snapshot=274964480
Virtual memory (bytes) snapshot=5519527936
Total committed heap usage (bytes)=137498624
Peak Map Physical memory (bytes)=182628352
Peak Map Virtual memory (bytes)=2756386816
Peak Reduce Physical memory (bytes)=92336128
Peak Reduce Virtual memory (bytes)=2763141120
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=51
File Output Format Counters
Bytes Written=73
[ssn@localhost hadoop]$ hdfs dfs -ls /output1
Found 2 items
-rw-r--r-- 1 ssn supergroup 0 2021-12-16 19:28 /output1/_SUCCESS
-rw-r--r-- 1 ssn supergroup 73 2021-12-16 19:28 /output1/part-r-00000
[ssn@localhost hadoop]$ hdfs dfs -cat /output1/part-r-00000
13 1
18lianwu 1
19zx 1
Eason 1
a 1
b 1
c 1
d 1
e 1
ssn 1
www.baidu.com 1
[ssn@localhost hadoop]$
案例分析:
[ssn@localhost hadoop]$ hdfs dfs -cat /test/1.log
Eason
ssn
13
19zx
18lianwu
a b c d e
www.baidu.com
[ssn@localhost hadoop]$
第一步:每一行按空格拆分单词,且每个单词赋予默认值为1
(Eason,1)
(ssn,1)
(13,1)
(19zx,1)
(18lianwu,1)
(a,1)(b,1)(c,1)(d,1)(e,1)
。。。。
第二步ruduce:按单词维度,统计每个单词出现的次数
如果a出现了两次:
a:1+1 ==> a 2
......
翻译成sql:select 单词,sum(value) from t group by 单词;