Hadoop 单节点部署

zhixingheyi_tian

已于 2024-01-04 18:03:54 修改

阅读量1.1k

点赞数

分类专栏： hadoop 文章标签： hadoop hdfs big data

于 2020-09-17 18:27:59 首次发布

本文链接：https://blog.csdn.net/zhixingheyi_tian/article/details/108649580

版权

hadoop 专栏收录该内容

33 篇文章 0 订阅

订阅专栏

Setup JDK

tar -zxvf jdk-8u112-linux-x64.tar.gz
ln -s jdk1.8.0_112 jdk

vim ~/.bash_profile

export JAVA_HOME=/home/jdk
PATH=${JAVA_HOME}/bin:$PATH
export PATH

source ~/.bash_profile

Standalone Operation

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

  $ mkdir input
  $ cp etc/hadoop/*.xml input
  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output 'dfs[a-z.]+'
  $ cat output/*

# cat output/*
1	dfsadmin

Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

  $ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

  $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys

Pseudo-Distributed Operation

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

Configuration

etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
  		<name>hadoop.tmp.dir</name>
	  	<value>/opt/data/</value>
  		<description>A base for other temporary directories.</description>
	</property>
</configuration>

etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

./etc/hadoop/hadoop-env.sh

export JAVA_HOME=/home/jdk

如果不在 haddop-env.sh 设置 JAVA_HOME 将无法感知，会报错，而且要写绝对路径

start hdfs

$ bin/hdfs namenode -format
 $ sbin/start-dfs.sh

# sbin/start-dfs.sh


Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/packages/hadoop-2.7.3/logs/hadoop-root-namenode-vsr264.out
localhost: starting datanode, logging to /home/packages/hadoop-2.7.3/logs/hadoop-root-datanode-vsr264.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/packages/hadoop-2.7.3/logs/hadoop-root-secondarynamenode-vsr264.out

Browse the web interface for the NameNode; by default it is available at:
Note: 注意 hadoop 3.x 相应的端口会出现变化

NameNode - http://localhost:50070/

常见错误

$ bin/hadoop fs -ls .
ls: `.': No such file or directory

该错误，需要创建

hadoop fs -mkdir /user/{user}

execute MapReduce jobs

Make the HDFS directories required to execute MapReduce jobs:

 $ bin/hdfs dfs -mkdir /user
 $ bin/hdfs dfs -mkdir /user/hello

Copy the input files into the distributed filesystem

 bin/hdfs dfs -put etc/hadoop /input

Run some of the examples provided:

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep /input/* output 'dfs[a-z.]+'

如果是 hadoop 3.2

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0.jar grep /input/* output 'dfs[a-z.]+'

check result

#bin/hdfs dfs -cat output/*
or
# bin/hdfs dfs -cat  hdfs://localhost:9000/user/root/output/*
6	dfs.audit.logger
4	dfs.class
3	dfs.server.namenode.
2	dfs.period
2	dfs.audit.log.maxfilesize
2	dfs.audit.log.maxbackupindex
1	dfsmetrics.log
1	dfsadmin
1	dfs.servers
1	dfs.replication
1	dfs.file
[root@vsr264 hadoop]# bin/hdfs dfs -ls  hdfs://localhost:9000/user/root/output/*
-rw-r--r--   1 root supergroup          0 2020-09-17 19:39 hdfs://localhost:9000/user/root/output/_SUCCESS
-rw-r--r--   1 root supergroup        197 2020-09-17 19:39 hdfs://localhost:9000/user/root/output/part-r-00000

查看目录

# hadoop fs -ls hdfs://localhost:9000/
Found 2 items
drwxr-xr-x   - root supergroup          0 2023-05-17 17:26 hdfs://localhost:9000/input
drwxr-xr-x   - root supergroup          0 2023-05-17 17:28 hdfs://localhost:9000/user

YARN on a Single Node

Configuration

etc/hadoop/mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Start ResourceManager daemon and NodeManager daemon

sbin/start-yarn.sh

Browse the web interface for the ResourceManager; by default it is available at:

ResourceManager - http://localhost:8088/

Reconfigure for Spark running

yarn 以下几个配置项必须比配置，否则 Spark on yarn 将遇到问题。
配置文件为： etc/hadoop/yarn-site.xml

  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>863304</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>96</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>776973</value>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
  </property>
  <property>
    <description>The maximum allocation for every container request at the RM
    in terms of virtual CPU cores. Requests higher than this will throw an
    InvalidResourceRequestException.</description>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>4</value>
  </property>

其默认的配置如下：

<property>
    <description>A comma separated list of services where service name should only
      contain a-zA-Z0-9_ and can not start with numbers</description>
    <name>yarn.nodemanager.aux-services</name>
    <value></value>
    <!--<value>mapreduce_shuffle</value>-->
  </property>


<property>
    <description>Amount of physical memory, in MB, that can be allocated
    for containers. If set to -1 and
    yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
    automatically calculated(in case of Windows and Linux).
    In other cases, the default is 8192MB.
    </description>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>-1</value>
  </property>


<property>
    <description>Number of vcores that can be allocated
    for containers. This is used by the RM scheduler when allocating
    resources for containers. This is not used to limit the number of
    CPUs used by YARN containers. If it is set to -1 and
    yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
    automatically determined from the hardware in case of Windows and Linux.
    In other cases, number of vcores is 8 by default.</description>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>-1</value>
  </property>



<property>
    <description>The maximum allocation for every container request at the RM
    in MBs. Memory requests higher than this will throw an
    InvalidResourceRequestException.</description>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>8192</value>
  </property>

<property>
    <description>The maximum allocation for every container request at the RM
    in terms of virtual CPU cores. Requests higher than this will throw an
    InvalidResourceRequestException.</description>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>4</value>
  </property>


<property>
    <description>Whether virtual memory limits will be enforced for
    containers.</description>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>true</value>
  </property>

Spark 测试验证实例

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode client \
    --driver-memory 1g \
    --executor-memory 1g \
    --executor-cores 1 \
    examples/jars/spark-examples*.jar \
    10

环境变量问题解决

hadoop 3.x 默认端口变化

Namenode ports: 50470 --> 9871, 50070 --> 9870, 8020 --> 9820
Secondary NN ports: 50091 --> 9869, 50090 --> 9868
Datanode ports: 50020 --> 9867, 50010 --> 9866, 50475 --> 9865, 50075 --> 9864

quick setup

Setup passphraseles

参见前面章节

替换配置文件

grep "sr242" ./  -rl 
grep "10.0.0.142" ./  -rl 
grep "DP_disk" ./  -rl 



sed -i "s/sr242/sr250/g" `grep "sr242" ./ -rl `
sed -i "s/10.0.0.142/10.0.0.150/g" `grep "10.0.0.142" ./ -rl `

grep "sr250" ./  -r
grep "10.0.0.150" ./  -r

要事先创建好 dfs.namenode.name.dir

  <property>
    <name>dfs.namenode.name.dir</name>
    <value>{nn}</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>{dn}</value>
  </property>

hdfs-default

默认配置在 hdfs-default.xml

<property>
  <name>dfs.namenode.name.dir</name>
  <value>file://${hadoop.tmp.dir}/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name node 
      should store the name table(fsimage).  If this is a comma-delimited list 
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

<property>
  <name>dfs.datanode.data.dir</name>
  <value>file://${hadoop.tmp.dir}/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices. The directories should be tagged
  with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS
  storage policies. The default storage type will be DISK if the directory does
  not have a storage type tagged explicitly. Directories that do not exist will
  be created if local filesystem permission allows.
  </description>
</property>

或者一劳永逸在core-site.xml 修改 hadoop.tmp.dir

<property>
  <name>hadoop.tmp.dir</name>
  <value>/opt/data/</value>
  <description>A base for other temporary directories.</description>
</property>

其默认配置在 core-default.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

这个配置是必须要修改的，否则 linux重启，相关目录会丢失，如出现以下错误

2023-05-19 14:33:39,374 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop-root/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.

就得使用 bin/hdfs namenode -format解决，一旦使用bin/hdfs namenode -format，相当于格式化，原先的meta 数据， data 数据都会消失，代价比较昂贵

然后 format namenode

$ bin/hdfs namenode -format

start

cd sbin
bash start-all.sh

如报以下错误

Starting namenodes on [sr250]
ERROR: JAVA_HOME is not set and could not be found.
Starting datanodes
ERROR: JAVA_HOME is not set and could not be found.
Starting secondary namenodes [sr250]
ERROR: JAVA_HOME is not set and could not be found.
Starting resourcemanager
ERROR: JAVA_HOME is not set and could not be found.
Starting nodemanagers
ERROR: JAVA_HOME is not set and could not be found.

可修改
etc/hadoop目录下的文件hadoop-env.sh，将语句“export JAVA_HOME=$JAVA_HOME”修改为“export JAVA_HOME=实际java路径”

设置相关环境变量

vim ~/.bashrc

INSTALL={}
export format_clean_hadoop=FALSE
export HADOOP_CONF_DIR={INSTALL}/hadoop/etc/hadoop
export HADOOP_HOME={INSTALL}/hadoop
export PATH={INSTALL}/hadoop/bin:{INSTALL}/hadoop/sbin:$PATH



export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

启动 spark history server

修改相关 IP:PORT 配置

./spark-defaults.conf:spark.eventLog.dir hdfs://{IP}:9000/spark-history-server
./spark-defaults.conf:spark.history.fs.logDirectory hdfs://{IP}:9000/spark-history-server
./spark-defaults.conf:spark.yarn.historyServer.address {IP}:18080
./spark-defaults.conf:spark.sql.warehouse.dir hdfs://{IP}:9000/spark-warehouse

创建 eventlog 目录

hadoop fs -mkdir hdfs://{IP}:9000/spark-history-server

常规 spark-defaults.conf 配置

spark.master                     yarn
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://localhost:9000/spark-history-server
spark.history.fs.logDirectory    hdfs://localhost:9000/spark-history-server
spark.yarn.historyServer.address 18080
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              1g
spark.executor.memory            1g
spark.executor.instances         2
spark.executor.cores             1
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

3.2以上版本，须使用 bash

bash start-history-server.sh

Spark

spark 启动historyserver，亦可使用本地文件系统。

#这两项是客户端配置
spark.eventLog.enabled           true
spark.eventLog.dir               file:///home/op/spark-3.2.2-bin-hadoop3.2/spark-events/
#这一项是服务端配置
spark.history.fs.logDirectory    file:///home/op/spark-3.2.2-bin-hadoop3.2/spark-events/

其他涉及到Hive 的配置修改

详情参见我的另一篇blog: https://blog.csdn.net/zhixingheyi_tian/article/details/131186733

Hadoop Pseudo-Distributed 一气呵成部署记录

修改各类配置

hdfs-site.xml

<configuration>
	<configuration>
		<property>
			<name>dfs.replication</name>
			<value>1</value>
		</property>
	</configuration>
</configuration>

core-site.xml

<configuration>
	<configuration>
		<property>
			<name>fs.defaultFS</name>
			<value>hdfs://localhost:9000</value>
		</property>
		<property>
			<name>hadoop.tmp.dir</name>
			<value>/opt/data/</value>
			<description>A base for other temporary directories.</description>
		</property>
		<property>
			<name>hadoop.proxyuser.root.hosts</name>
			<value>*</value>
		</property>
		<property>
			<name>hadoop.proxyuser.root.groups</name>
			<value>*</value>
		</property>

	</configuration>
</configuration>

yarn-site.xml

<configuration>

	<!-- Site specific YARN configuration properties -->
	<configuration>
		<property>
			<name>yarn.nodemanager.aux-services</name>
			<value>mapreduce_shuffle</value>
		</property>
		<property>
			<name>yarn.nodemanager.resource.memory-mb</name>
			<value>7168</value>
		</property>
		<property>
			<name>yarn.nodemanager.resource.cpu-vcores</name>
			<value>2</value>
		</property>
		<property>
		    <description>The maximum allocation for every container request at the RM
    in terms of virtual CPU cores. Requests higher than this will throw an
    InvalidResourceRequestException.</description>
   			<name>yarn.scheduler.maximum-allocation-vcores</name>
    		<value>1</value>
 		</property>
		<property>
			<name>yarn.scheduler.maximum-allocation-mb</name>
			<value>5120</value>
		</property>
		<property>
			<name>yarn.nodemanager.vmem-check-enabled</name>
			<value>false</value>
		</property>
	</configuration>

</configuration>

mapred-site.xml

<configuration>
	<configuration>
		<property>
			<name>mapreduce.framework.name</name>
			<value>yarn</value>
		</property>
		<property>
			<name>yarn.app.mapreduce.am.env</name>
			<value>HADOOP_MAPRED_HOME=/opt/Beaver/hadoop/</value>
		</property>
		<property>
			<name>mapreduce.map.env</name>
			<value>HADOOP_MAPRED_HOME=/opt/Beaver/hadoop/</value>
		</property>
		<property>
			<name>mapreduce.reduce.env</name>
			<value>HADOOP_MAPRED_HOME=/opt/Beaver/hadoop/</value>
		</property>
	</configuration>

</configuration>

修改环境变量

一般性的环境变量参考上面的小节，

需要注意的是，如果本地 ssh 的端口号修改了，那就需要在 hadoop-env.sh 上增加以下环境 options

export HADOOP_SSH_OPTS="-p 10000"

启动进程

bin/hdfs namenode -format
bash sbin/start-dfs.sh
bash sbin/start-yarn.sh

zhixingheyi_tian

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop 单节点部署

Setup passphraseless sshNow check that you can ssh to the localhost without a passphrase: $ ssh localhostIf you cannot ssh to localhost without a passphrase, execute the following commands: $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/i
复制链接

扫一扫