Hadoop项目学习

Hadoop项目

1. 使用华为云服务器
  • 使用华为云购买VPS,并使用XShell连接(访问github外网,购买中国香港)

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-73RPT4gq-1659157887499)(C:\Users\yjr26\AppData\Roaming\Typora\typora-user-images\image-20220526170221927.png)]

2. 编译安装Hadoop
  • 安装GCC

    • 可以跳过步骤2,3,具体实践过程中使用系统自带yum源没有报错
  • 安装 Protobuf步骤5中,patch -p1 < protoc.patch报错:-bash: patch: command not found

    解决:yum -y install patch

  • 安装 Protobuf步骤7中,报错(未解决,目前没看到有什么影响)

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PeGrpG7J-1659157887500)(C:\Users\yjr26\AppData\Roaming\Typora\typora-user-images\image-20220527162031442.png)]

  • 编译成功

    ...
    [INFO] Apache Hadoop Azure support ........................ SUCCESS [ 10.346 s]
    [INFO] Apache Hadoop Aliyun OSS support ................... SUCCESS [  5.318 s]
    [INFO] Apache Hadoop Client Aggregator .................... SUCCESS [  2.450 s]
    [INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [  1.571 s]
    [INFO] Apache Hadoop Resource Estimator Service ........... SUCCESS [  6.384 s]
    [INFO] Apache Hadoop Azure Data Lake support .............. SUCCESS [ 11.636 s]
    [INFO] Apache Hadoop Image Generation Tool ................ SUCCESS [  0.393 s]
    [INFO] Apache Hadoop Tools Dist ........................... SUCCESS [  8.816 s]
    [INFO] Apache Hadoop Tools ................................ SUCCESS [  0.024 s]
    [INFO] Apache Hadoop Client API ........................... SUCCESS [01:40 min]
    [INFO] Apache Hadoop Client Runtime ....................... SUCCESS [01:13 min]
    [INFO] Apache Hadoop Client Packaging Invariants .......... SUCCESS [  0.982 s]
    [INFO] Apache Hadoop Client Test Minicluster .............. SUCCESS [02:25 min]
    [INFO] Apache Hadoop Client Packaging Invariants for Test . SUCCESS [  0.149 s]
    [INFO] Apache Hadoop Client Packaging Integration Tests ... SUCCESS [  0.105 s]
    [INFO] Apache Hadoop Distribution ......................... SUCCESS [ 21.004 s]
    [INFO] Apache Hadoop Client Modules ....................... SUCCESS [  0.025 s]
    [INFO] Apache Hadoop Cloud Storage ........................ SUCCESS [  0.697 s]
    [INFO] Apache Hadoop Cloud Storage Project 3.1.1 .......... SUCCESS [  0.025 s]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 33:13 min
    [INFO] Finished at: 2022-05-27T22:35:15+08:00
    [INFO] ------------------------------------------------------------------------
    
    
3. 测试运行Hadoop:
  • 试运行Hadoop:

    [root@hadoop02 hadoop-3.1.1]# pwd
    /root/hadoop-3.1.1-src/hadoop-dist/target/hadoop-3.1.1
    [root@hadoop02 hadoop-3.1.1]# bin/hadoop
    Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
     or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
      where CLASSNAME is a user-provided Java class
    
      OPTIONS is none or any of:
    
    buildpaths                       attempt to add class files from build tree
    --config dir                     Hadoop config directory
    --debug                          turn on shell script debug mode
    --help                           usage information
    hostnames list[,of,host,names]   hosts to use in slave mode
    hosts filename                   list of hosts to use in slave mode
    loglevel level                   set the log4j level for this command
    workers                          turn on worker mode
    
      SUBCOMMAND is one of:
    
    
        Admin Commands:
    
    daemonlog     get/set the log level for each daemon
    
        Client Commands:
    
    archive       create a Hadoop archive
    checknative   check native Hadoop and compression libraries availability
    classpath     prints the class path needed to get the Hadoop jar and the required libraries
    conftest      validate configuration 
    ...
    key           manage keys via the KeyProvider
    rumenfolder   scale a rumen input trace
    rumentrace    convert logs into a rumen trace
    s3guard       manage metadata on S3
    trace         view and modify Hadoop tracing settings
    version       print the version
    
        Daemon Commands:
    
    kms           run KMS, the Key Management Server
    
    SUBCOMMAND may print help when invoked w/o parameters or with -h.
    [root@hadoop02 hadoop-3.1.1]# 
    
  • 单节点安装

    https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/SingleCluster.html

    • Standalone Operation:

      [root@hadoop02 hadoop-3.1.1]# ls
      bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  README.txt  sbin  share
      [root@hadoop02 hadoop-3.1.1]# mkdir input
      [root@hadoop02 hadoop-3.1.1]# cp etc/hadoop/*.xml input
      [root@hadoop02 hadoop-3.1.1]# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar grep input output 'dfs[a-z.]+'2022-05-28 00:02:27,848 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
      2022-05-28 00:02:27,896 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
      2022-05-28 00:02:27,896 INFO impl.MetricsSystemImpl: JobTracker metrics system started
      2022-05-28 00:02:28,002 INFO input.FileInputFormat: Total input files to process : 9
      2022-05-28 00:02:28,022 INFO mapreduce.JobSubmitter: number of splits:9
      2022-05-28 00:02:28,116 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local146364211_0001
      2022-05-28 00:02:28,117 INFO mapreduce.JobSubmitter: Executing with tokens: []
      2022-05-28 00:02:28,212 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
      2022-05-28 00:02:28,214 INFO mapreduce.Job: Running job: job_local146364211_0001
      2022-05-28 00:02:28,215 INFO mapred.LocalJobRunner: OutputCommitter set in config null
      2022-05-28 00:02:28,221 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 2
      ...
      2022-05-28 00:02:30,322 INFO mapreduce.Job: Counters: 30
      	File System Counters
      		FILE: Number of bytes read=1336966
      		FILE: Number of bytes written=3249308
      		FILE: Number of read operations=0
      		FILE: Number of large read operations=0
      		FILE: Number of write operations=0
      	Map-Reduce Framework
      		Map input records=1
      		Map output records=1
      		Map output bytes=17
      		Map output materialized bytes=25
      		Input split bytes=157
      		Combine input records=0
      		Combine output records=0
      		Reduce input groups=1
      		Reduce shuffle bytes=25
      		Reduce input records=1
      		Reduce output records=1
      		Spilled Records=2
      		Shuffled Maps =1
      		Failed Shuffles=0
      		Merged Map outputs=1
      		GC time elapsed (ms)=0
      		Total committed heap usage (bytes)=2527068160
      	Shuffle Errors
      		BAD_ID=0
      		CONNECTION=0
      		IO_ERROR=0
      		WRONG_LENGTH=0
      		WRONG_MAP=0
      		WRONG_REDUCE=0
      	File Input Format Counters 
      		Bytes Read=123
      	File Output Format Counters 
      		Bytes Written=23
      [root@hadoop02 hadoop-3.1.1]# cat output/*
      1	dfsadmin
      [root@hadoop02 hadoop-3.1.1]#
      
    • Pseudo-Distributed Operation(报错

      [root@hadoop02 hadoop-3.1.1]# sbin/start-dfs.sh(第二步)
      Starting namenodes on [localhost]
      ERROR: Attempting to operate on hdfs namenode as root
      ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
      Starting datanodes
      ERROR: Attempting to operate on hdfs datanode as root
      ERROR: but there is no HDFS_DATANODE_USER defined. Aborting operation.
      Starting secondary namenodes [hadoop02]
      ERROR: Attempting to operate on hdfs secondarynamenode as root
      ERROR: but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
      [root@hadoop02 hadoop-3.1.1]#
      

    解决:(74条消息) 搭建虚拟机hadoop时,输入./sbin/start-dfs.sh启动hadoop,解决ERROR: Attempting to operate on hdfs namenode_小亮泽的博客-CSDN博客_sbin/start-dfs.sh

    [root@hadoop02 hadoop-3.1.1]# sbin/start-dfs.sh
    Starting namenodes on [localhost]
    Last login: Mon May 30 17:58:02 CST 2022 from ::1 on pts/2
    localhost: 
    localhost: Authorized users only. All activities may be monitored and reported.
    localhost: root@localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
    Starting datanodes
    Last login: Mon May 30 18:00:15 CST 2022 on pts/2
    localhost: 
    localhost: Authorized users only. All activities may be monitored and reported.
    localhost: root@localhost: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
    Starting secondary namenodes [hadoop02]
    Last login: Mon May 30 18:00:15 CST 2022 on pts/2
    hadoop02: Warning: Permanently added 'hadoop02' (ECDSA) to the list of known hosts.
    hadoop02: 
    hadoop02: Authorized users only. All activities may be monitored and reported.
    hadoop02: root@hadoop02: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
    [root@hadoop02 hadoop-3.1.1]#
    
  • 集群安装:(多机部署)

4. 性能调优
  • 鲲鹏代码迁移工具

    快速入门-鲲鹏代码迁移工具-鲲鹏开发套件-文档首页–鲲鹏社区 (hikunpeng.com)

    • 获取软件包

      [root@hadoop02 ~]# wget https://mirror.iscas.ac.cn/kunpeng/archive/Porting_Dependency/Packages/Porting-advisor_2.3.0_linux-Kunpeng.tar.gz
      
    • 解压

      [root@hadoop02 ~]# tar --no-same-owner -zxvf Porting-advisor_2.3.0_linux-Kunpeng.tar.gz
      
    • 执行runtime_env_check脚本,检查鲲鹏代码迁移工具的依赖文件。

      [root@hadoop02 Porting-advisor_2.3.0_linux-Kunpeng]#bash runtime_env_check.sh
      
    • 安装

      [root@hadoop02 Porting-advisor_2.3.0_linux-Kunpeng]# .\install web
      Porting Web console is now running, go to: https://192.168.0.250:8084/porting/#/login
      Successfully installed the Kunpeng Porting Advisor in /opt/portadv/.
      
    • Web操作界面打不开

      解决:【软件迁移】鲲鹏代码迁移工具无法打开Web页面_鲲鹏众智_鲲鹏论坛_华为云论坛 (huaweicloud.com)

      [root@hadoop02 package]# chmod 777 hadoop-3.1.1.tar.gz #给文件所有用户加上可读可写可执行权限(r->4,w->2,x->1)
      

      [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G3Dide0N-1659157887501)(C:\Users\yjr26\AppData\Roaming\Typora\typora-user-images\image-20220618215330804.png)]

      代码迁移工具的用户名:portadmin

  • 鲲鹏性能分析工具

    • 获取软件包:鲲鹏官网或者VSCode提示处

    • 解压

    • 安装:

      ./install.sh
      
    • 性能分析工具的用户名:tunadmin

5.Git源码到Gitee
git config --global user.name "yjr"
git config --global user.email "yjr2609538400@163.com" #Git的全局设置
mkdir xxx #新建文件夹
cd xxx #进入
git init #使本地文件夹成为一个本地git仓库,运行后文件夹下会生成一个.git文件夹
git add xxx #将要git的本地文件夹添加至本地暂缓区,add .表示该目录下所有文件
git commit -m main #将暂缓区文件提交至本地版本库的main分支(可改名)。
git remote add origin https://gitee.com/Marches7/hadoop_test.git #远程仓库的地址,在克隆/下载那里复制
git push -u origin main #将本地main分支代码提交至远程main分支上

git pull origin main #将远程main分支代码拉到本地main分支上(合并)
#github需要使用token登录,待解决
6.Hadoop实操

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-cS4sur2V-1659157887501)(C:\Users\yjr26\AppData\Roaming\Typora\typora-user-images\image-20220604211747848.png)]

7.注意

编译过程,安装 Protobuf那”步骤6 回到protobuf-2.5.0根目录,编译并安装到系统默认目录“。

./configure CFLAGS=‘-fsigned-char’

8. 生成补丁和打补丁

Linux下的一些开源软件源码包,我们去对源码进行修改后,为了方便后续的更新,回滚,可以采用补丁的方式生成我们对源码的修改过程,这样别人拿到补丁,就可以直接使用,而不用再做修改源码的同样的操作。

  • 补丁的生成
diff -uNr src_path modify_path > patch file

作用说明:对于diff命令,它的功能就是逐行比较两个文件的不同,然后输出比较的结果。

参数说明:

-u :以统一格式创建补丁文件,这种格式比缺省格式更紧凑些
-N :选项确保补丁文件将正确地处理已经创建和删除文件的情况
-r :递归选项,设置了这个选项,diff会将两个不同版本源代码目录中的所有对应文件全部都进行一次比较,包括子目录文件
src_path :源文件地址
modify_path :修改后文件地址
patch file :生成补丁文件名,一般以.patch为后缀
">" 是重定向符号,表示将diff比较的结果重定向输入到patch_file文件中(如不指定重定向,diff的结果将打印到标准输出)
  • 打补丁

    patch -pN < xxx.patch
    

    参数说明:

    -pN :选项打补丁时要忽略掉第N层目录
    
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值