hdfs的文件目录导致的磁盘空间不足问题
- 先来看下hdfs的文件目录配置:
volumes:
- hadoop_datanode1:/hadoop/dfs/data
- 上面的hadoop_datanode1数据卷的配置在docker-compose.yml的最底部,是默认声明,如下:
volumes:
hadoop_namenode:
hadoop_datanode1:
hadoop_datanode2:
hadoop_datanode3:
hadoop_historyserver:
- 在容器运行状态,执行命令docker inspect datanode1查看容器信息,和数据卷相关的信息如下所示:
“Mounts”: [
{
“Type”: “volume”,
“Name”: “temp_hadoop_datanode1”,
“Source”: “/var/lib/docker/volumes/temp_hadoop_datanode1/_data”,
“Destination”: “/hadoop/dfs/data”,
“Driver”: “local”,
“Mode”: “rw”,
“RW”: true,
“Propagation”: “”
}
]
可见hdfs容器的文件目录对应的是宿主机的/var/lib/docker/volumes;
4. 用df -m看看磁盘空间情况,如下所示,“/var/lib/docker/volumes"所在的”/dev/nvme0n1p3"设备可用空间只有20多G(29561),显然在保存大量文件时这个空间是不够的,而且hdfs的默认副本数为3:
root@willzhao-deepin:/data/work/spark/temp# df -m
文件系统 1M-块 已用 可用 已用% 挂载点
udev 7893 0 7893 0% /dev
tmpfs 1584 4 1581 1% /run
/dev/nvme0n1p3 43927 12107 29561 30% /
tmpfs 7918 0 7918 0% /dev/shm
tmpfs 5 1 5 1% /run/lock
tmpfs 7918 0 7918 0% /sys/fs/cgroup
/dev/nvme0n1p4 87854 181 83169 1% /home
/dev/nvme0n1p1 300 7 293 3% /boot/efi
/dev/sda1 468428 109152 335430 25% /data
tmpfs 1584 1 1584 1% /run/user/108
tmpfs 1584 0 1584 0% /run/user/0
- 上面的磁盘信息显示设备/dev/sda1还有300G,所以hdfs的文件目录映射到/dev/sda1就能缓解磁盘空间问题了,于是修改docker-compose.yml文件中hdfs的三个数据节点的配置,修改后如下:
datanode1:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
container_name: datanode1
depends_on:
- namenode
volumes:
- ./hadoop/datanode1:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode2:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
container_name: datanode2
depends_on:
- namenode
volumes:
- ./hadoop/datanode2:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode3:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
container_name: datanode3
depends_on:
- namenode
volumes:
- ./hadoop/datanode3:/hadoop/dfs/data
env_file:
- ./hadoop.env
再将下面这段配置删除:
volumes:
hadoop_namenode:
hadoop_datanode1:
hadoop_datanode2:
hadoop_datanode3:
hadoop_historyserver:
开发master的4040和work的8080端口
-
任务运行过程中,如果有UI页面来观察详情,可以帮助我们更全面直观的了解运行情况,所以需要修改配置开放端口;
-
如下所示,expose参数增加4040,表示对外暴露4040端口,ports参数增加4040:4040,表示容器的4040映射到宿主机的4040端口:
master:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: master
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
links:
- namenode
expose:
-
4040
-
7001
-
7002
-
7003
-
7004
-
7005
-
7077
-
6066
ports:
-
4040:4040
-
6066:6066
-
7077:7077
-
8080:8080
volumes:
-
./conf/master:/conf
-
./data:/tmp/data
-
./jars:/root/jars
- worker的web端口同样需要打开,访问worker的web页面可以观察worker的状态,并且查看任务日志(这个很重要),这里要注意的是由于有多个worker,所以要映射到宿主机的多个端口,如下配置,workder1的environment.SPARK_WORKER_WEBUI_PORT配置为8081,并且暴露8081,再将容器的8081映射到宿主机的8081,workder2的environment.SPARK_WORKER_WEBUI_PORT配置为8082,并且暴露8082,再将容器的8082映射到宿主机的8082:
worker1:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker1
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker1
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8081
ports:
- 8081:8081
volumes:
-
./conf/worker1:/conf
-
./data/worker1:/tmp/data
worker2:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker2
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker2
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8082
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8082
ports:
- 8082:8082
volumes:
-
./conf/worker2:/conf
-
./data/worker2:/tmp/data
worker3-worker6的配置与上面类似,注意用不同的端口号;
至此,修改已经完成,最终版的docker-compose.yml内容如下:
version: “2.2”
services:
namenode:
image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8
container_name: namenode
volumes:
-
./hadoop/namenode:/hadoop/dfs/name
-
./input_files:/input_files
environment:
- CLUSTER_NAME=test
env_file:
- ./hadoop.env
ports:
- 50070:50070
resourcemanager:
image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8
container_name: resourcemanager
depends_on:
-
namenode
-
datanode1
-
datanode2
env_file:
- ./hadoop.env
historyserver:
image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8
container_name: historyserver
depends_on:
-
namenode
-
datanode1
-
datanode2
volumes:
- ./hadoop/historyserver:/hadoop/yarn/timeline
env_file:
- ./hadoop.env
nodemanager1:
image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8
container_name: nodemanager1
depends_on:
-
namenode
-
datanode1
-
datanode2
env_file:
- ./hadoop.env
datanode1:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
container_name: datanode1
depends_on:
- namenode
volumes:
- ./hadoop/datanode1:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode2:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
container_name: datanode2
depends_on:
- namenode
volumes:
- ./hadoop/datanode2:/hadoop/dfs/data
env_file:
- ./hadoop.env
datanode3:
image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8
container_name: datanode3
depends_on:
- namenode
volumes:
- ./hadoop/datanode3:/hadoop/dfs/data
env_file:
- ./hadoop.env
master:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: master
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
links:
- namenode
expose:
-
4040
-
7001
-
7002
-
7003
-
7004
-
7005
-
7077
-
6066
ports:
-
4040:4040
-
6066:6066
-
7077:7077
-
8080:8080
volumes:
-
./conf/master:/conf
-
./data:/tmp/data
-
./jars:/root/jars
worker1:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker1
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker1
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8081
ports:
- 8081:8081
volumes:
-
./conf/worker1:/conf
-
./data/worker1:/tmp/data
worker2:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker2
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker2
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8082
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8082
ports:
- 8082:8082
volumes:
-
./conf/worker2:/conf
-
./data/worker2:/tmp/data
worker3:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker3
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker3
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8083
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8083
ports:
- 8083:8083
volumes:
-
./conf/worker3:/conf
-
./data/worker3:/tmp/data
worker4:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker4
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker4
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8084
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8084
ports:
- 8084:8084
volumes:
-
./conf/worker4:/conf
-
./data/worker4:/tmp/data
worker5:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker5
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker5
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8085
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8085
ports:
- 8085:8085
volumes:
-
./conf/worker5:/conf
-
./data/worker5:/tmp/data
worker6:
image: gettyimages/spark:2.3.0-hadoop-2.8
container_name: worker6
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker6
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 2g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8086
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
-
7012
-
7013
-
7014
-
7015
-
8881
-
8086
ports:
- 8086:8086
volumes:
-
./conf/worker6:/conf
-
./data/worker6:/tmp/data
接下来我们运行一个实例来验证;
验证
- 在docker-compose.yml所在目录创建hadoop.env文件,内容如下:
CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource___tracker_address=resourcemanager:8031
- 修改好docker-composes.yml后,执行以下命令启动容器:
docker-compose up -d
-
此次验证所用的spark应用的功能是分析维基百科的网站统计信息,找出访问量最大的网页,本次实战用现成的jar包,不涉及编码,该应用的源码和开发详情请参照《spark实战之:分析维基百科网站统计数据(java版)》;
-
从github下载已经构建好的spark应用jar文件:
wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/sparkdemo-1.0-SNAPSHOT.jar
- 从github下载维基百科的网站统计信息大数据集,这里只下载了一个文件,建议您参照《寻找海量数据集用于大数据开发实战(维基百科网站统计数据)》下载更多文件用来实战:
wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/pagecounts-20160801-000000
-
将下载的sparkdemo-1.0-SNAPSHOT.jar文件放在docker-compose.xml所在目录的jars目录下;
-
在docker-compose.xml所在目录的input_files目录内创建input目录,再将下载的pagecounts-20160801-000000文件放在这个input目录下;
-
执行以下命令,将整个input目录放入hdfs:
docker exec namenode hdfs dfs -put /input_files/input /
自我介绍一下,小编13年上海交大毕业,曾经在小公司待过,也去过华为、OPPO等大厂,18年进入阿里一直到现在。
深知大多数Java工程师,想要提升技能,往往是自己摸索成长或者是报班学习,但对于培训机构动则几千的学费,着实压力不小。自己不成体系的自学效果低效又漫长,而且极易碰到天花板技术停滞不前!
因此收集整理了一份《2024年Java开发全套学习资料》,初衷也很简单,就是希望能够帮助到想自学提升又不知道该从何学起的朋友,同时减轻大家的负担。
既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,基本涵盖了95%以上Java开发知识点,真正体系化!
由于文件比较大,这里只是将部分目录截图出来,每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频,并且会持续更新!
如果你觉得这些内容对你有帮助,可以扫码获取!!(备注:Java)
最后
笔者已经把面试题和答案整理成了面试专题文档
《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取!
t目录放入hdfs:
docker exec namenode hdfs dfs -put /input_files/input /
自我介绍一下,小编13年上海交大毕业,曾经在小公司待过,也去过华为、OPPO等大厂,18年进入阿里一直到现在。
深知大多数Java工程师,想要提升技能,往往是自己摸索成长或者是报班学习,但对于培训机构动则几千的学费,着实压力不小。自己不成体系的自学效果低效又漫长,而且极易碰到天花板技术停滞不前!
因此收集整理了一份《2024年Java开发全套学习资料》,初衷也很简单,就是希望能够帮助到想自学提升又不知道该从何学起的朋友,同时减轻大家的负担。[外链图片转存中…(img-18sCnsnk-1713823851939)]
[外链图片转存中…(img-XOWDR2r6-1713823851940)]
[外链图片转存中…(img-6pJkPLek-1713823851940)]
既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,基本涵盖了95%以上Java开发知识点,真正体系化!
由于文件比较大,这里只是将部分目录截图出来,每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频,并且会持续更新!
如果你觉得这些内容对你有帮助,可以扫码获取!!(备注:Java)
[外链图片转存中…(img-q9DPyOnu-1713823851940)]
最后
笔者已经把面试题和答案整理成了面试专题文档
[外链图片转存中…(img-XWgmzjdj-1713823851941)]
[外链图片转存中…(img-LN0YR26x-1713823851941)]
[外链图片转存中…(img-GcPzzrij-1713823851941)]
[外链图片转存中…(img-VqoayEnW-1713823851941)]
[外链图片转存中…(img-rHO9ehfq-1713823851941)]
[外链图片转存中…(img-LVwrzZgh-1713823851942)]
《互联网大厂面试真题解析、进阶开发核心学习笔记、全套讲解视频、实战项目源码讲义》点击传送门即可获取!