Hadoop伪分布式搭建练习

1、环境版本

Linux版本

CentOS Linux release 7.6.1810 (Core) 

内核版本

3.10.0-957.el7.x86_64

 Hadoop版本

2、Hadoop配置

所有配置,都可以在hadoop官网里找到,现在简单描述一下自己的配置

core-site.xml配置

核心配置文件,主要定义了我们的集群是分布式,还是本机运行

配置解释:

fs.defaultFS

The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

表示我们使用分布式文件系统,决定了namenode在哪台机器上面

hadoop.tmp.dir

A base for other temporary directories.

hadoop临时目录

hadoop.proxyuser.root.hosts

hadoop.proxyuser.$superuser.hosts    配置该superUser允许通过代理访问的主机节点
hadoop.proxyuser.$superuser.groups    配置该superUser允许代理的用户所属组
hadoop.proxyuser.$superuser.users    配置该superUser允许代理的用户

proxyuser详细介绍

https://www.jianshu.com/p/a27bc8651533


另外一些主要参数介绍

io.file.buffer.size

The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.

缓冲区大小,实际工作中根据服务器性能动态调整

fs.trash.interval

Number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. This option may be configured both on the server and the client. If trash is disabled server side then the client side configuration is checked. If trash is enabled on the server side then the value configured on the server is used and the client configuration value is ignored.

开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟

hdfs-site.xml

分布式文件系统的核心配置

决定了我们数据存放在哪个路径,数据的副本,数据的block大小

dfs.replication

Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

文件的副本数

dfs.namenode.secondary.http-address

The secondary namenode http server address and port.

定义了我们secondaryNamenode的通信地址、辅助namenode管理数据信息

dfs.namenode.http-address

The address and the base port where the dfs namenode web ui will listen on.

定义我们通过浏览器来访问我们的hdfs的端口

dfs.namenode.name.dir

Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

定义元数据fsimage的存储路径

dfs.datanode.data.dir

Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. The directories should be tagged with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS storage policies. The default storage type will be DISK if the directory does not have a storage type tagged explicitly. Directories that do not exist will be created if local filesystem permission allows.

定义datanode的存储路径

dfs.namenode.edits.dir

Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir

edits文件的存放路径,最近一段时间元数据的编辑信息

dfs.namenode.checkpoint.dir

Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir    Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy.

元数据保存点存放位置

dfs.permissions.enabled

If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.

dfs的权限验证是否打开

dfs.blocksize

The default block size for new files, in bytes. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB).

文件block块的大小

hadoop-env.sh

就改一下JAVA_HOME变量就好了

vi hadoop-env.sh

export JAVA_HOME=/usr/local/jdk1.8.0_211

mapred-site.xml

定义了mapreduce运行的一些参数

mapreduce.framework.name

The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.framework.name is "local".

制定mapreduce运行的框架

mapreduce.job.ubertask.enable

Whether to enable the small-jobs "ubertask" optimization, which runs "sufficiently small" jobs sequentially within a single JVM. "Small" is defined by the following maxmaps, maxreduces, and maxbytes settings. Note that configurations for application masters also affect the "Small" definition - yarn.app.mapreduce.am.resource.mb must be larger than both mapreduce.map.memory.mb and mapreduce.reduce.memory.mb, and yarn.app.mapreduce.am.resource.cpu-vcores must be larger than both mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores to enable ubertask. Users may override this value.

mapreduce的小任务模式是否开启

mapreduce.jobhistory.address

MapReduce JobHistory Server IPC host:port

定义了jobhistory的通信地址

查看历史任务完成信息

mapreduce.jobhistory.webapp.address

MapReduce JobHistory Server Web UI host:port

浏览器界面查看jobhistory地址

mapreduce.application.classpath

CLASSPATH for MR applications. A comma-separated list of CLASSPATH entries. If mapreduce.application.framework is set then this must specify the appropriate classpath for that archive, and the name of the archive must be present in the classpath. If mapreduce.app-submission.cross-platform is false, platform-specific environment vairable expansion syntax would be used to construct the default CLASSPATH entries. For Linux: $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*. For Windows: %HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*, %HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*. If mapreduce.app-submission.cross-platform is true, platform-agnostic default CLASSPATH for MR applications would be used: {{HADOOP_MAPRED_HOME}}/share/hadoop/mapreduce/*, {{HADOOP_MAPRED_HOME}}/share/hadoop/mapreduce/lib/* Parameter expansion marker will be replaced by NodeManager on container launch based on the underlying OS accordingly.

Hadoop 3.x.x版本需要加上的配置

主要是设置JAVA_HOME、HDFS_NAMENODE_USER、HDFS_DATANODE_USER、HDFS_SECONDNAMENODE_USER、YARN_RESORCEMANAGER_USER、YARN_NODEMANAGER_USER变量

yarn-site.xml

yarn资源调度配置

yarn.resourcemanager.hostname

The hostname of the RM.

定义resourceManager所在的机器

yarn.nodemanager.aux-services

A comma separated list of services where service name should only contain a-zA-Z0-9_ and can not start with numbers

yarn.log-aggregation-enable

Whether to enable log aggregation. Log aggregation collects each container's logs and moves these logs onto a file-system, for e.g. HDFS, after the application completes. Users can configure the "yarn.nodemanager.remote-app-log-dir" and "yarn.nodemanager.remote-app-log-dir-suffix" properties to determine where these logs are moved to. Users can access the logs via the Application Timeline Server.

开启日志的聚集功能,可以让我们在jobhistory界面上查看我们的运行日志

yarn.log-aggregation.retain-seconds

How long to keep aggregation logs before deleting them. -1 disables. Be careful set this too small and you will spam the name node.

yarn.nodemanager.env-whitelist

Environment variables that containers may override rather than use NodeManager's default.

yarn环境变量白名单

yarn.application.classpath

CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries. When this value is empty, the following default CLASSPATH for YARN applications would be used. For Linux: $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* For Windows: %HADOOP_CONF_DIR%, %HADOOP_COMMON_HOME%/share/hadoop/common/*, %HADOOP_COMMON_HOME%/share/hadoop/common/lib/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*

yarn的环境变量配置

salves

定义了我们的从节点是哪些机器

3、初始化、启动集群

要启动Hadoop集群,需要启动HDFS和YARN两个模块

注意:首次启动HDFS时,必须对其进行格式化操作。本质上是一些清理和准备工作,因为此时的HDFS在物理上还是不存在的

格式化命令

hdfs namenode -format

hadoop namenode -format

启动集群

sbin/start-dfs.sh

sbin/start-yarn.sh

sbin/mr-jobhistory-daemon.sh start

启动后jps显示

web网页显示

nodemanager

hadoop网页

任务运行页面

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值