1、环境版本
Linux版本
CentOS Linux release 7.6.1810 (Core)
内核版本
3.10.0-957.el7.x86_64
Hadoop版本
2、Hadoop配置
所有配置,都可以在hadoop官网里找到,现在简单描述一下自己的配置
core-site.xml配置
核心配置文件,主要定义了我们的集群是分布式,还是本机运行
配置解释:
fs.defaultFS
The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
表示我们使用分布式文件系统,决定了namenode在哪台机器上面
hadoop.tmp.dir
A base for other temporary directories.
hadoop临时目录
hadoop.proxyuser.root.hosts
hadoop.proxyuser.$superuser.hosts 配置该superUser允许通过代理访问的主机节点
hadoop.proxyuser.$superuser.groups 配置该superUser允许代理的用户所属组
hadoop.proxyuser.$superuser.users 配置该superUser允许代理的用户
proxyuser详细介绍
https://www.jianshu.com/p/a27bc8651533
另外一些主要参数介绍
io.file.buffer.size
The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.
缓冲区大小,实际工作中根据服务器性能动态调整
fs.trash.interval
Number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. This option may be configured both on the server and the client. If trash is disabled server side then the client side configuration is checked. If trash is enabled on the server side then the value configured on the server is used and the client configuration value is ignored.
开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟
hdfs-site.xml
分布式文件系统的核心配置
决定了我们数据存放在哪个路径,数据的副本,数据的block大小
dfs.replication
Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
文件的副本数
dfs.namenode.secondary.http-address
The secondary namenode http server address and port.
定义了我们secondaryNamenode的通信地址、辅助namenode管理数据信息
dfs.namenode.http-address
The address and the base port where the dfs namenode web ui will listen on.
定义我们通过浏览器来访问我们的hdfs的端口
dfs.namenode.name.dir
Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
定义元数据fsimage的存储路径
dfs.datanode.data.dir
Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. The directories should be tagged with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS storage policies. The default storage type will be DISK if the directory does not have a storage type tagged explicitly. Directories that do not exist will be created if local filesystem permission allows.
定义datanode的存储路径
dfs.namenode.edits.dir
Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir
edits文件的存放路径,最近一段时间元数据的编辑信息
dfs.namenode.checkpoint.dir
Determines where on the local filesystem the DFS name node should store the transaction (edits) file. If this is a comma-delimited list of directories then the transaction file is replicated in all of the directories, for redundancy. Default value is same as dfs.namenode.name.dir Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is replicated in all of the directories for redundancy.
元数据保存点存放位置
dfs.permissions.enabled
If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or group of files or directories.
dfs的权限验证是否打开
dfs.blocksize
The default block size for new files, in bytes. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB).
文件block块的大小
hadoop-env.sh
就改一下JAVA_HOME变量就好了
vi hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_211
mapred-site.xml
定义了mapreduce运行的一些参数
mapreduce.framework.name
The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapreduce.framework.name is "local".
制定mapreduce运行的框架
mapreduce.job.ubertask.enable
Whether to enable the small-jobs "ubertask" optimization, which runs "sufficiently small" jobs sequentially within a single JVM. "Small" is defined by the following maxmaps, maxreduces, and maxbytes settings. Note that configurations for application masters also affect the "Small" definition - yarn.app.mapreduce.am.resource.mb must be larger than both mapreduce.map.memory.mb and mapreduce.reduce.memory.mb, and yarn.app.mapreduce.am.resource.cpu-vcores must be larger than both mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores to enable ubertask. Users may override this value.
mapreduce的小任务模式是否开启
mapreduce.jobhistory.address
MapReduce JobHistory Server IPC host:port
定义了jobhistory的通信地址
查看历史任务完成信息
mapreduce.jobhistory.webapp.address
MapReduce JobHistory Server Web UI host:port
浏览器界面查看jobhistory地址
mapreduce.application.classpath
CLASSPATH for MR applications. A comma-separated list of CLASSPATH entries. If mapreduce.application.framework is set then this must specify the appropriate classpath for that archive, and the name of the archive must be present in the classpath. If mapreduce.app-submission.cross-platform is false, platform-specific environment vairable expansion syntax would be used to construct the default CLASSPATH entries. For Linux: $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*, $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*. For Windows: %HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/*, %HADOOP_MAPRED_HOME%/share/hadoop/mapreduce/lib/*. If mapreduce.app-submission.cross-platform is true, platform-agnostic default CLASSPATH for MR applications would be used: {{HADOOP_MAPRED_HOME}}/share/hadoop/mapreduce/*, {{HADOOP_MAPRED_HOME}}/share/hadoop/mapreduce/lib/* Parameter expansion marker will be replaced by NodeManager on container launch based on the underlying OS accordingly.
Hadoop 3.x.x版本需要加上的配置
主要是设置JAVA_HOME、HDFS_NAMENODE_USER、HDFS_DATANODE_USER、HDFS_SECONDNAMENODE_USER、YARN_RESORCEMANAGER_USER、YARN_NODEMANAGER_USER变量
yarn-site.xml
yarn资源调度配置
yarn.resourcemanager.hostname
The hostname of the RM.
定义resourceManager所在的机器
yarn.nodemanager.aux-services
A comma separated list of services where service name should only contain a-zA-Z0-9_ and can not start with numbers
yarn.log-aggregation-enable
Whether to enable log aggregation. Log aggregation collects each container's logs and moves these logs onto a file-system, for e.g. HDFS, after the application completes. Users can configure the "yarn.nodemanager.remote-app-log-dir" and "yarn.nodemanager.remote-app-log-dir-suffix" properties to determine where these logs are moved to. Users can access the logs via the Application Timeline Server.
开启日志的聚集功能,可以让我们在jobhistory界面上查看我们的运行日志
yarn.log-aggregation.retain-seconds
How long to keep aggregation logs before deleting them. -1 disables. Be careful set this too small and you will spam the name node.
yarn.nodemanager.env-whitelist
Environment variables that containers may override rather than use NodeManager's default.
yarn环境变量白名单
yarn.application.classpath
CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries. When this value is empty, the following default CLASSPATH for YARN applications would be used. For Linux: $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* For Windows: %HADOOP_CONF_DIR%, %HADOOP_COMMON_HOME%/share/hadoop/common/*, %HADOOP_COMMON_HOME%/share/hadoop/common/lib/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*
yarn的环境变量配置
salves
定义了我们的从节点是哪些机器
3、初始化、启动集群
要启动Hadoop集群,需要启动HDFS和YARN两个模块
注意:首次启动HDFS时,必须对其进行格式化操作。本质上是一些清理和准备工作,因为此时的HDFS在物理上还是不存在的
格式化命令
hdfs namenode -format
hadoop namenode -format
启动集群
sbin/start-dfs.sh
sbin/start-yarn.sh
sbin/mr-jobhistory-daemon.sh start
启动后jps显示
web网页显示
nodemanager
hadoop网页
任务运行页面