Setup JDK
tar -zxvf jdk-8u112-linux-x64.tar.gz
ln -s jdk1.8.0_112 jdk
vim ~/.bash_profile
export JAVA_HOME=/home/jdk
PATH=${JAVA_HOME}/bin:$PATH
export PATH
source ~/.bash_profile
Standalone Operation
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
$ cat output/*
# cat output/*
1 dfsadmin
Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration
etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/data/</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
./etc/hadoop/hadoop-env.sh
export JAVA_HOME=/home/jdk
如果不在 haddop-env.sh 设置 JAVA_HOME 将无法感知,会报错,而且要写绝对路径
start hdfs
$ bin/hdfs namenode -format
$ sbin/start-dfs.sh
# sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/packages/hadoop-2.7.3/logs/hadoop-root-namenode-vsr264.out
localhost: starting datanode, logging to /home/packages/hadoop-2.7.3/logs/hadoop-root-datanode-vsr264.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/packages/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-vsr264.out
Browse the web interface for the NameNode; by default it is available at:
Note: 注意 hadoop 3.x 相应的端口 会出现变化
NameNode - http://localhost:50070/
常见错误
$ bin/hadoop fs -ls .
ls: `.': No such file or directory
该错误,需要创建
hadoop fs -mkdir /user/{user}
execute MapReduce jobs
Make the HDFS directories required to execute MapReduce jobs:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hello
Copy the input files into the distributed filesystem
bin/hdfs dfs -put etc/hadoop /input
Run some of the examples provided:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep /input/* output 'dfs[a-z.]+'
如果是 hadoop 3.2
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar grep /input/* output 'dfs[a-z.]+'
check result
#bin/hdfs dfs -cat output/*
or
# bin/hdfs dfs -cat hdfs://localhost:9000/user/root/output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
[root@vsr264 hadoop]# bin/hdfs dfs -ls hdfs://localhost:9000/user/root/output/*
-rw-r--r-- 1 root supergroup 0 2020-09-17 19:39 hdfs://localhost:9000/user/root/output/_SUCCESS
-rw-r--r-- 1 root supergroup 197 2020-09-17 19:39 hdfs://localhost:9000/user/root/output/part-r-00000
查看目录
# hadoop fs -ls hdfs://localhost:9000/
Found 2 items
drwxr-xr-x - root supergroup 0 2023-05-17 17:26 hdfs://localhost:9000/input
drwxr-xr-x - root supergroup 0 2023-05-17 17:28 hdfs://localhost:9000/user
YARN on a Single Node
Configuration
etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Start ResourceManager daemon and NodeManager daemon
sbin/start-yarn.sh
Browse the web interface for the ResourceManager; by default it is available at:
ResourceManager - http://localhost:8088/
Reconfigure for Spark running
yarn 以下几个配置项 必须比配置,否则 Spark on yarn 将遇到问题。
配置文件为: etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>863304</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>96</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>776973</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<description>The maximum allocation for every container request at the RM
in terms of virtual CPU cores. Requests higher than this will throw an
InvalidResourceRequestException.</description>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>4</value>
</property>
其默认的配置如下:
<property>
<description>A comma separated list of services where service name should only
contain a-zA-Z0-9_ and can not start with numbers</description>
<name>yarn.nodemanager.aux-services</name>
<value></value>
<!--<value>mapreduce_shuffle</value>-->
</property>
<property>
<description>Amount of physical memory, in MB, that can be allocated
for containers. If set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
automatically calculated(in case of Windows and Linux).
In other cases, the default is 8192MB.
</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>-1</value>
</property>
<property>
<description>Number of vcores that can be allocated
for containers. This is used by the RM scheduler when allocating
resources for containers. This is not used to limit the number of
CPUs used by YARN containers. If it is set to -1 and
yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
automatically determined from the hardware in case of Windows and Linux.
In other cases, number of vcores is 8 by default.</description>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>-1</value>
</property>
<property>
<description>The maximum allocation for every container request at the RM
in MBs. Memory requests higher than this will throw an
InvalidResourceRequestException.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>
<description>The maximum allocation for every container request at the RM
in terms of virtual CPU cores. Requests higher than this will throw an
InvalidResourceRequestException.</description>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>4</value>
</property>
<property>
<description>Whether virtual memory limits will be enforced for
containers.</description>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
</property>
Spark 测试验证实例
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \
examples/jars/spark-examples*.jar \
10
环境变量问题解决
hadoop 3.x 默认端口变化
Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820
Secondary NN ports: 50091 --> 9869, 50090 --> 9868
Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075 --> 9864
quick setup
Setup passphraseles
参见前面章节
替换配置文件
grep "sr242" ./ -rl
grep "10.0.0.142" ./ -rl
grep "DP_disk" ./ -rl
sed -i "s/sr242/sr250/g" `grep "sr242" ./ -rl `
sed -i "s/10.0.0.142/10.0.0.150/g" `grep "10.0.0.142" ./ -rl `
grep "sr250" ./ -r
grep "10.0.0.150" ./ -r
要事先创建好 dfs.namenode.name.dir
<property>
<name>dfs.namenode.name.dir</name>
<value>{nn}</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>{dn}</value>
</property>
hdfs-default
默认配置在 hdfs-default.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/name</value>
<description>Determines where on the local filesystem the DFS name node
should store the name table(fsimage). If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices. The directories should be tagged
with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS
storage policies. The default storage type will be DISK if the directory does
not have a storage type tagged explicitly. Directories that do not exist will
be created if local filesystem permission allows.
</description>
</property>
或者一劳永逸在core-site.xml 修改 hadoop.tmp.dir
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/data/</value>
<description>A base for other temporary directories.</description>
</property>
其默认配置在 core-default.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
这个配置是必须要修改的,否则 linux重启,相关目录会丢失,如出现以下错误
2023-05-19 14:33:39,374 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
就得使用 bin/hdfs namenode -format解决,一旦使用bin/hdfs namenode -format,相当于格式化,原先的meta 数据 , data 数据都会消失,代价比较昂贵
然后 format namenode
$ bin/hdfs namenode -format
start
cd sbin
bash start-all.sh
如报以下错误
Starting namenodes on [sr250]
ERROR: JAVA_HOME is not set and could not be found.
Starting datanodes
ERROR: JAVA_HOME is not set and could not be found.
Starting secondary namenodes [sr250]
ERROR: JAVA_HOME is not set and could not be found.
Starting resourcemanager
ERROR: JAVA_HOME is not set and could not be found.
Starting nodemanagers
ERROR: JAVA_HOME is not set and could not be found.
可修改
etc/hadoop目录下的文件hadoop-env.sh,将语句“export JAVA_HOME=$JAVA_HOME”修改为“export JAVA_HOME=实际java路径”
设置相关环境变量
vim ~/.bashrc
INSTALL={}
export format_clean_hadoop=FALSE
export HADOOP_CONF_DIR={INSTALL}/hadoop/etc/hadoop
export HADOOP_HOME={INSTALL}/hadoop
export PATH={INSTALL}/hadoop/bin:{INSTALL}/hadoop/sbin:$PATH
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
启动 spark history server
修改相关 IP:PORT 配置
./spark-defaults.conf:spark.eventLog.dir hdfs://{IP}:9000/spark-history-server
./spark-defaults.conf:spark.history.fs.logDirectory hdfs://{IP}:9000/spark-history-server
./spark-defaults.conf:spark.yarn.historyServer.address {IP}:18080
./spark-defaults.conf:spark.sql.warehouse.dir hdfs://{IP}:9000/spark-warehouse
创建 eventlog 目录
hadoop fs -mkdir hdfs://{IP}:9000/spark-history-server
常规 spark-defaults.conf 配置
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://localhost:9000/spark-history-server
spark.history.fs.logDirectory hdfs://localhost:9000/spark-history-server
spark.yarn.historyServer.address 18080
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 1g
spark.executor.memory 1g
spark.executor.instances 2
spark.executor.cores 1
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
3.2以上版本,须使用 bash
bash start-history-server.sh
Spark
spark 启动historyserver,亦可使用本地文件系统。
#这两项是客户端配置
spark.eventLog.enabled true
spark.eventLog.dir file:///home/op/spark-3.2.2-bin-hadoop3.2/spark-events/
#这一项是服务端配置
spark.history.fs.logDirectory file:///home/op/spark-3.2.2-bin-hadoop3.2/spark-events/
其他涉及到Hive 的配置修改
详情参见 我的另一篇blog: https://blog.csdn.net/zhixingheyi_tian/article/details/131186733
Hadoop Pseudo-Distributed 一气呵成部署记录
修改各类配置
hdfs-site.xml
<configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
</configuration>
core-site.xml
<configuration>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/data/</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
</configuration>
</configuration>
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>7168</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<description>The maximum allocation for every container request at the RM
in terms of virtual CPU cores. Requests higher than this will throw an
InvalidResourceRequestException.</description>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>5120</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
</configuration>
mapred-site.xml
<configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/Beaver/hadoop/</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/Beaver/hadoop/</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/Beaver/hadoop/</value>
</property>
</configuration>
</configuration>
修改环境变量
一般性的环境变量参考上面的小节,
需要注意的是,如果本地 ssh 的端口号修改了,那就需要在 hadoop-env.sh 上增加以下环境 options
export HADOOP_SSH_OPTS="-p 10000"
启动进程
bin/hdfs namenode -format
bash sbin/start-dfs.sh
bash sbin/start-yarn.sh