文章目录
1. 环境背景
1. 公司的大数据平台是基于CDH搭建的
Hadoop的发行版本有很多,有华为发行版,Intel发行版,Cloudera发行版(CDH),MapR版本,以及HortonWorks版本等。所有发行版都是基于Apache Hadoop衍生出来的,产生这些版本的原因,是由于Apache Hadoop的开源协议决定的:任何人可以对其进行修改,并作为开源或商业产品发布和销售。
不收费的版本主要有三个(都是国外厂商):
- Cloudera版本(Cloudera’s Distribution Including Apache Hadoop)简称"CDH"。
- Apache基金会hadoop
- Hontonworks版本(Hortonworks Data Platform)简称"HDP"。
我们公司的平台主要是基于CDH搭建的,hadoop的版本是2.6
[deploy@hbase03 ~]$ hadoop version
Hadoop 2.6.0-cdh5.15.1
服务器的系统是centos6.8
[deploy@hbase03 ~]$ lsb_release -a
LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 6.8 (Final)
Release: 6.8
Codename: Final
2. 使用了kerberos 作为权限管理
之前没有怎么了解过kerberos,只是大概知道类似私钥公钥的管理模式,在这里不展开太多,
具体的想了解可以参考这里 and 这里
这里只介绍一下client端需要做的设置
- xxx.keytab文件(存储了用户信息以及秘钥)
- krb5.conf 文件(定义了kerberos的一些基本信息,比如ream域,加密方式等等)
xxx.keytab 一般放到 /etc/keytab/xxx.keytab
krb5.conf 必须放到 /etc/krb5.conf
2. 安装方式
我的目标是想在本地(macos)安装一个hadoop client,能够访问stag环境的hadoop集群。结果没有想到这个这么难以实现,都把我整哭了。
1. 在其他centos服务器上的安装测试
实际上在这一步之前我已经在mac上折腾一波了,但是死活不行,所以又退回来在centos的服务器上装一个试试。
当前的centos服务器用户是deploy
1. 安装配置kerberos
安装
yum install krb5-workstation krb5-libs krb5-auth-dialog -y
从hadoop的机器上copy xxx.keytab文件,krb5.conf文件,分别放到 /etc/keytab, /etc目录
mv deploy.keytab /etc/keytab/deploy.keytab
mv krb5.conf /etc/krb5.conf
执行kinit 产生票据
[deploy@server-02 ~]$ kinit -kt /etc/keytab/deploy.keytab deploy
[deploy@server-02 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_501
Default principal: deploy@SAMPLE.COM
Valid starting Expires Service principal
07/31/20 10:22:32 08/01/20 10:22:32 krbtgt/SAMPLE.COM@SAMPLE.COM
renew until 07/31/20 10:22:32
2. 安装client包
使用CDH安装的好像还挺复杂的,用了很多shell做封装,简单看一下结构
对应的目录在 /home/deploy/hadoop_client_test/hadoop2-client/
[deploy@server-02 hadoop_client_test]$ pwd
/home/deploy/hadoop_client_test
[deploy@server-02 hadoop_client_test]$ tree -L 3 hadoop2-client/
hadoop2-client/
│
├── bin
│ ├── hadoop
│ ├── hdfs
│ └── yarn.cmd
├── conf -> hadoop2-conf
├── hadoop1-conf
│ ...
│ └── topology.py
├── hadoop2-conf
│ ├── core-site.xml
│ ├── hadoop-env.sh
│ ├── hdfs-site.xml
│ ├── log4j.properties
│ ├── ssl-client.xml
│ ├── topology.map
│ └── topology.py
├── lib
│
├── libexec
│ ├── hadoop-config.sh
│ ├── hdfs-config.sh
│ ├── httpfs-config.sh
│ ├── kms-config.sh
│ ├── mapred-config.sh
│ └── yarn-config.sh
├── sbin
└── share
为了方便理解,我把和shell运行原理没有关系的全部都干掉了,只保留了相关的一部分来进行解释。
下面的解释主要是为了搞清hdfs命令最后如何选择hadoop的配置
总体的调用逻辑是这样的,当前的目录是在 hadoop2-client
当我运行 ./bin/hdfs
的时候,hdfs是一个shell文件,他会做以下的事情
- 设置BIN_DIR=${./bin}
- 设置HADOOP_LIBEXEC_DIR=${./libexec},并执行export, export HADOOP_LIBEXEC_DIR=xxx ,让子进程能够拿到
- 执行${HADOOP_LIBEXEC_DIR}/hdfs-config.sh, 也就是上面的
./libexec/hdfs-config.sh
- 没有其他处理逻辑执行${HADOOP_LIBEXEC_DIR}/hadoop-config.sh,也就是上面的
./libexec/hadoop-config.sh
- 设置DEFAULT_CONF_DIR ,也就是hadoop集群相关的一些配置
- 如果存在./conf/hadoop-env.sh 则 DEFAULT_CONF_DIR=conf 也就是上面的通过超链接又指向了
hadoop2-conf
的conf
,在当前的配置下回选中这个 - 反之,如果
./conf/hadoop-env.sh
不存在,会使用./etc/hadoop
在CDH的集群上都是走的这个
- 如果存在./conf/hadoop-env.sh 则 DEFAULT_CONF_DIR=conf 也就是上面的通过超链接又指向了
- 设置DEFAULT_CONF_DIR ,也就是hadoop集群相关的一些配置
- 没有其他处理逻辑执行${HADOOP_LIBEXEC_DIR}/hadoop-config.sh,也就是上面的
- 下面就是各种根据参数运行hdfs命令了
然后执行下面的命令
[deploy@server-02 hadoop2-client]$ ./bin/hdfs dfs -ls /
Found 5 items
drwxrwxrwx - deploy supergroup 0 2020-05-29 19:16 /flink
drwx------ - hbase hbase 0 2020-06-30 10:23 /hbase
drwxrwx--- - mapred supergroup 0 2020-07-15 10:40 /home
drwxrwxrwt - hdfs supergroup 0 2020-07-28 10:04 /tmp
drwxrwxrwx - hdfs supergroup 0 2020-07-28 10:04 /user
至此,成功在centos上面装上hadoop2的client,可以操作hdfs。
2. 在mac上安装hadooop2 client
mac上的用户是admin
1. 安装配置kerberos
brew install krb5
剩下的配置和contos上的配置保持一致。
执行 kinit
➜ ~ kinit -kt /etc/keytab/deploy.keytab deploy
➜ ~ klist
Ticket cache: KCM:501
Default principal: deploy@SAMPLE.COM
Valid starting Expires Service principal
07/31/20 15:50:54 08/01/20 15:50:54 krbtgt/SAMPLE.COM@SAMPLE.COM
renew until 07/31/20 15:50:54
可以看到到也是正常产生了ticket
2. 安装hadoop client
同样的套路,这个也不再赘述。
3. 噩梦开始
在执行hdfs命令之后,噩梦开始
./bin/hdfs dfs -ls /
报错如下
20/07/30 19:04:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/30 19:04:38 WARN security.UserGroupInformation: PriviledgedActionException as:admin (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/07/30 19:04:38 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/07/30 19:04:38 WARN security.UserGroupInformation: PriviledgedActionException as:admin (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/07/30 19:04:38 WARN security.UserGroupInformation: PriviledgedActionException as:admin (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/07/30 19:04:38 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/07/30 19:04:38 WARN security.UserGroupInformation: PriviledgedActionException as:admin (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
20/07/30 19:04:38 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over bj3-stag-all-hbase02.tencn/10.76.0.100:8020 after 1 fail over attempts. Trying to fail over immediately.
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "localhost/127.0.0.1"; destination host is: "bj3-stag-all-hbase02.tencn":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
at org.apache.hadoop.ipc.Client.call(Client.java:1472)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
...
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.hadoop.ipc.Client.call(Client.java:1438)
... 28 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
... 31 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
... 40 more
只是截取了一部分异常,而且将异常栈去掉了一部分,要不然太长了。
这个报错显示的很显然是没有可用的kerberos tgt ,也就是鉴权没有通过。同时在报错中可以看到as:admin
字样,让我一度怀疑是因为我本地是admin账户而不是deploy账户?
因为前面看一些相关资料里面hadoop好像还需要专门对keytab中的用户进行处理,主要是理解的不到位,怀疑有可能是这个问题,参考这里有对hadoop使用keytab解析的说明,说实话这里还是理解的不到位,后来的配置也感觉keytab用户好像和本地用户没有什么关系,除非是有些地方理解依然有偏差。
这个时候心一横,本地创建了一个deploy用户,一堆权限问题,总算把权限问题搞定了,执行hdfs命令发现还是报一样的错,无非是报错的admin换成了deploy,绝望。
不死心,后来通过多方google,终于发现一个解决类似问题的方案。
4. 增加debug功能
原来hadooop可以在运行的时候打开kerberos的debug功能,这样就可以更进一步看看问题出在哪里了
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true ${HADOOP_OPTS}"
再执行同样的命令
./bin/hdfs dfs -ls /
报错信息基本上和之前一样,但是在最开头报出来了这样的信息
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
>>>KinitOptions cache name is /tmp/krb5cc_501
>> Acquire default native Credentials
default etypes for default_tkt_enctypes: 17 16 23.
>>> Found no TGT's in LSA
20/07/30 20:16:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
...
...
这里面说了 KinitOptions cache name is /tmp/krb5cc_501
也就是会在/tmp/krb5cc_501 文件中找对应的ticket,记得在执行klist的时候看到过类似的信息
5. 原来mac和centos的ticket存储的位置不一样
在mac上执行了klist
➜ ~ klist
Ticket cache: KCM:501
Default principal: deploy@SAMPLE.COM
Valid starting Expires Service principal
07/31/20 15:50:54 08/01/20 15:50:54 krbtgt/SAMPLE.COM@SAMPLE.COM
renew until 07/31/20 15:50:54
cache位置是KCM:501
在centos再看看
[deploy@server-02 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_501
Default principal: deploy@SAMPLE.COM
Valid starting Expires Service principal
07/31/20 10:22:32 08/01/20 10:22:32 krbtgt/SAMPLE.COM@SAMPLE.COM
renew until 07/31/20 10:22:32
[deploy@server-02 ~]$
这里是 FILE:/tmp/krb5cc_501
,原来是这里不一致导致的。
结合debug日志用显示需要的是/tmp/krb5cc_501
那么我们生成ticket的时候需要在对应的目录才行。
kinit -c FILE:/tmp/krb5cc_501 -kt /etc/keytab/deploy.keytab deploy
➜ klist -c FILE:/tmp/krb5cc_501
Ticket cache: FILE:/tmp/krb5cc_501
Default principal: deploy@SAMPLE.COM
Valid starting Expires Service principal
07/30/20 20:50:08 07/31/20 20:50:08 krbtgt/SAMPLE.COM@SAMPLE.COM
renew until 07/30/20 20:50:08
开心,终于算是找到一点解决问题的思路
小心翼翼的执行
./bin/hdfs dfs -ls /
果然,扑面而来的还是满屏的报错😞
6. 原来还有加密问题
这个时候因为debug模式还是生效的,在报错的最前面看到的是这样的信息
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
>>>KinitOptions cache name is /tmp/krb5cc_501
>>>DEBUG <CCacheInputStream> client principal is deploy@SAMPLE.COM
>>>DEBUG <CCacheInputStream> server principal is krbtgt/SAMPLE.COM@SAMPLE.COM
>>>DEBUG <CCacheInputStream> key type: 18
>>>DEBUG <CCacheInputStream> auth time: Fri Jul 31 16:51:50 CST 2020
>>>DEBUG <CCacheInputStream> start time: Fri Jul 31 16:51:50 CST 2020
>>>DEBUG <CCacheInputStream> end time: Sat Aug 01 16:51:50 CST 2020
>>>DEBUG <CCacheInputStream> renew_till time: Fri Jul 31 16:51:50 CST 2020
>>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL;
>>>DEBUG <CCacheInputStream> client principal is deploy@SAMPLE.COM
>>>DEBUG <CCacheInputStream> server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/SAMPLE.COM@SAMPLE.COM@SAMPLE.COM
>>>DEBUG <CCacheInputStream> key type: 0
>>>DEBUG <CCacheInputStream> auth time: Thu Jan 01 08:00:00 CST 1970
>>>DEBUG <CCacheInputStream> start time: null
>>>DEBUG <CCacheInputStream> end time: Thu Jan 01 08:00:00 CST 1970
>>>DEBUG <CCacheInputStream> renew_till time: null
>>> CCacheInputStream: readFlags()
>>> KrbCreds found the default ticket granting ticket in credential cache.
>>> unsupported key type found the default TGT: 18
>> Acquire default native Credentials
default etypes for default_tkt_enctypes: 17 16 23.
>>> Found no TGT's in LSA
20/07/31 16:51:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
...
...
这里看到前面好像是初步找到了deploy@SAMPLE.COM
的principal,在执行到后面才发生报错
>>> unsupported key type found the default TGT: 18
>> Acquire default native Credentials
default etypes for default_tkt_enctypes: 17 16 23.
>>> Found no TGT's in LSA
根据 unsupported key type found the default TGT: 18
进行google,才知道这个可能是本地的jdk不支持krb5.conf中的加密设置导致,参考这里。
根据解释,这里的type 18, 表示的是AES256的一种加密算法。而当前的JDK并不支持这种方式。所以导致无法解密。
调整/etc/krb5.conf的加密设置
[libdefaults]
default_realm = SAMPLE.COM
dns_lookup_kdc = false
dns_lookup_realm = false
ticket_lifetime = 604800
renew_lifetime = 604800
forwardable = true
#default_tgs_enctypes = aes256-cts aes128-cts des3-hmac-sha1 arcfour-hmac des-hmac-sha1 des-cbc-md5 des-cbc-crc
#default_tkt_enctypes = aes256-cts aes128-cts des3-hmac-sha1 arcfour-hmac des-hmac-sha1 des-cbc-md5 des-cbc-crc
#permitted_enctypes = aes256-cts aes128-cts des3-hmac-sha1 arcfour-hmac des-hmac-sha1 des-cbc-md5 des-cbc-crc
default_tgs_enctypes = aes256-cts aes128-cts arcfour-hmac-md5 des-cbc-md5 des-cbc-crc
default_tkt_enctypes = arcfour-hmac-md5 aes256-cts aes128-cts des-cbc-md5 des-cbc-crc
permitted_enctypes = aes256-cts aes128-cts arcfour-hmac-md5 des-cbc-md5 des-cbc-crc
udp_preference_limit = 1
kdc_timeout = 3000
[realms]
SAMPLE.COM = {
kdc = kdc1.service.kk.srv
admin_server = kdc1.service.kk.srv
}
[domain_realm]
重新生成ticket
kdestroy -c FILE:/tmp/krb5cc_501
kinit -c FILE:/tmp/krb5cc_501 -kt /etc/keytab/deploy.keytab deploy
小心翼翼的执行
./bin/hdfs dfs -ls /
输出
Java config name: null
Native config name: /etc/krb5.conf
Loaded from native config
>>>KinitOptions cache name is /tmp/krb5cc_501
>>>DEBUG <CCacheInputStream> client principal is deploy@SAMPLE.COM
>>>DEBUG <CCacheInputStream> server principal is krbtgt/SAMPLE.COM@SAMPLE.COM
>>>DEBUG <CCacheInputStream> key type: 23
>>>DEBUG <CCacheInputStream> auth time: Thu Jul 30 20:50:08 CST 2020
>>>DEBUG <CCacheInputStream> start time: Thu Jul 30 20:50:08 CST 2020
>>>DEBUG <CCacheInputStream> end time: Fri Jul 31 20:50:08 CST 2020
>>>DEBUG <CCacheInputStream> renew_till time: Thu Jul 30 20:50:08 CST 2020
>>> CCacheInputStream: readFlags() FORWARDABLE; RENEWABLE; INITIAL;
>>>DEBUG <CCacheInputStream> client principal is deploy@SAMPLE.COM
>>>DEBUG <CCacheInputStream> server principal is X-CACHECONF:/krb5_ccache_conf_data/fast_avail/krbtgt/SAMPLE.COM@SAMPLE.COM@SAMPLE.COM
>>>DEBUG <CCacheInputStream> key type: 0
>>>DEBUG <CCacheInputStream> auth time: Thu Jan 01 08:00:00 CST 1970
>>>DEBUG <CCacheInputStream> start time: null
>>>DEBUG <CCacheInputStream> end time: Thu Jan 01 08:00:00 CST 1970
>>>DEBUG <CCacheInputStream> renew_till time: null
>>> CCacheInputStream: readFlags()
>>> KrbCreds found the default ticket granting ticket in credential cache.
>>> Obtained TGT from LSA: Credentials:
client=deploy@SAMPLE.COM
server=krbtgt/SAMPLE.COM@SAMPLE.COM
authTime=20200730125008Z
startTime=20200730125008Z
endTime=20200731125008Z
renewTill=20200730125008Z
flags=FORWARDABLE;RENEWABLE;INITIAL
EType (skey)=23
(tkt key)=18
20/07/30 20:51:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found ticket for deploy@SAMPLE.COM to go to krbtgt/SAMPLE.COM@SAMPLE.COM expiring on Fri Jul 31 20:50:08 CST 2020
Entered Krb5Context.initSecContext with state=STATE_NEW
Found ticket for deploy@SAMPLE.COM to go to krbtgt/SAMPLE.COM@SAMPLE.COM expiring on Fri Jul 31 20:50:08 CST 2020
Service ticket not found in the subject
>>> Credentials acquireServiceCreds: same realm
default etypes for default_tgs_enctypes: 17 23.
>>> CksumType: sun.security.krb5.internal.crypto.RsaMd5CksumType
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
>>> KdcAccessibility: reset
>>> KrbKdcReq send: kdc=kdc1.service.kk.srv TCP:88, timeout=3000, number of retries =3, #bytes=655
>>> KDCCommunication: kdc=kdc1.service.kk.srv TCP:88, timeout=3000,Attempt =1, #bytes=655
>>>DEBUG: TCPClient reading 642 bytes
>>> KrbKdcReq send: #bytes read=642
>>> KdcAccessibility: remove kdc1.service.kk.srv
>>> EType: sun.security.krb5.internal.crypto.ArcFourHmacEType
>>> KrbApReq: APOptions are 00100000 00000000 00000000 00000000
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Krb5Context setting mySeqNumber to: 919666512
Created InitSecContextToken:
0000: 01 00 6E 82 02 32 30 82 02 2E A0 03 02
...
...
Entered Krb5Context.initSecContext with state=STATE_IN_PROCESS
>>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType
Krb5Context setting peerSeqNumber to: 860175268
Krb5Context.unwrap: token=[05 04 01 ff 00 0c 00 00 00 00 00 00 33 45 3b a4 01 01 00 00 c6 aa 28 08 f7 4a 07 3a 76 ca 47 e7 ]
Krb5Context.unwrap: data=[01 01 00 00 ]
Krb5Context.wrap: data=[01 01 00 00 ]
Krb5Context.wrap: token=[05 04 00 ff 00 0c 00 00 00 00 00 00 16 b8 57 a2 01 01 00 00 fe 2c 1e ba 43 fc 1d 9f 9d 84 22 12 ]
Found 5 items
drwxrwxrwx - deploy supergroup 0 2020-05-29 19:16 /flink
drwx------ - hbase hbase 0 2020-06-30 10:23 /hbase
drwxrwx--- - mapred supergroup 0 2020-07-15 10:40 /home
drwxrwxrwt - hdfs supergroup 0 2020-07-28 10:04 /tmp
drwxrwxrwx - hdfs supergroup 0 2020-07-28 10:04 /user
快要喜极而泣了🤦♂️
参考:
https://community.cloudera.com/t5/Community-Articles/Connect-Hadoop-client-on-Mac-OS-X-to-Kerberized-HDP-cluster/ta-p/248917
https://www.jianshu.com/p/cc523d5a715d