hadoop-kerbero等相关搭建可能遇到的问题和排查

最新推荐文章于 2024-06-20 23:33:11 发布

Winhole

最新推荐文章于 2024-06-20 23:33:11 发布

阅读量4.3k

点赞数 1

分类专栏： hadoop Linux

本文链接：https://blog.csdn.net/YINHAOXU1/article/details/100729742

版权

Linux 同时被 2 个专栏收录

30 篇文章 1 订阅

订阅专栏

hadoop

9 篇文章 0 订阅

订阅专栏

./datapool/loongoop/tmp/hadoop-yarn/staging/yarntest/.staging/job_1537338393901_0001/job.jar does not exist

问题:18/09/19 14:27:38 INFO mapreduce.Job: Task Id : attempt_1537338393901_0001_r_000000_2, Status : FAILED

File /datapool/loongoop/tmp/hadoop-yarn/staging/yarntest/.staging/job_1537338393901_0001/job.jar does not exist

java.io.FileNotFoundException: File /datapool/loongoop/tmp/hadoop-yarn/staging/yarntest/.staging/job_1537338393901_0001/job.jar does not exist

解决：

可能是不同节点之间的用户uid，gid(hadoop) 不一致导致的

Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)

解决：应该是认证用户失败导致的

Klist 看一下认证用户情况：

还需要注意 ticketcache需要存在本地

/etc/krb5.conf 注释掉：# default_ccache_name = KEYRING:persistent:%{uid}

Cannot create input directory QuasiMonteCarlo_1537322397847_320594172/in

解决：

1 先确认存储路径是否有user文件夹： /datapool/user 权限是否为777

2. 检查其他文件夹权限

exit code: 1 ExitCodeException exitCode=1:org.apache.hadoop.util.Shell.runCommand(Shell.java:538)

解决：基本是权限问题

试试加入权限组

参考第三个解决方法

ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error in populating headers :java.io.IOException: Error Reading IndexFile

Caused by: java.lang.ExceptionInInitializerError

Caused by: java.lang.RuntimeException: Secure IO is not possible without native code extensions.

Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.io.SecureIOUtils

解决：是 1. 没有lib/native包的原因

2. lib/native 路径设置不正确，导致搜寻不到

vim etc/hadoop/hadoop-env.sh (不需要多余的设置)

Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user

Login failure for root/_HOST @LOONGOOP.COM from keytab /opt/loongoop-0.0.8/all.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user

解决：一般是由于keytab路径不对或者权限不合格

Caused by: java.lang.ClassNotFoundException: org.apache.commons.codec.binary.Base64

java.io.IOException: Failed on local exception: java.io.IOException: Couldn't set up IO streams;

解决：

/opt/aa/loongoop-0.0.8/share/hadoop/common/lib 检查这个下面 commons-codec-1.x.jar

是否存在且对于其他用户需要读权限

org.apache.hadoop.ipc.Server: IPC Server handler 36 on 8031

解决：

应该是各节点帐号不一致导致

或者HTTP/ 认证问题？

java.io.IOException: The ownership on the staging directory /tmp/hadoop-yarn/staging/test7/.staging is not as expected. It is owned by omm. The directory must be owned by the submitter test7 or by test7

这个需要去目录leofs:/// tmp/hadoop-yarn/staging/test7/.staging 改成当前票据所属

一般先执行一个pi程序即可：

./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar pi 10 100

Caused by: java.io.IOException: Running in secure mode, but config doesn't have a keytab

Hbase启动，jps查看服务没有启动成功

原因：没有添加kerberos认证，找不到认证文件

Caused by: java.io.IOException: Login failure for hbase/slaver2@LOONGOOP.COM from keytab /opt/hbase-1.2.2/conf/hbase.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user

Hregionserver没有启动成功，原因：当前节点的认证查不到，文件没有这个节点的票据;或者权限不足，路径填写错误等（首尾空格也不可以有）

java.io.IOException: Directory /datapool/hbase/WALs/master,16020,1561520698609-splitting is not empty

原因：wals里存在重复文件

解决办法：删除wals文件对应节点的数据。然后重启hbase

https://blog.csdn.net/liu16659/article/details/80960323

org.apache.hadoop.hbase.security.AccessDeniedException: org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions (user=aa@LOONGOOP.COM, scope=default, params=[namespace=default,table=default:tab4,family=info],action=CREATE)

可能导致该问题的原因:

linux最常见的权限问题:

当前使用的账户没有建表权限

解决方法: 用权限更高的账户对该用户进行赋权

另一个可能就是namespace命名空间的限制:也就是说只有按照ns的格式来创建表格,例如: create 'nc_table:table','info'

注意: nc_table是一个已经存在的nc

14.hbase shell运行时，grant 报错，提示hbase:acl

hbase(main):001:0> grant 'aa','CA'

ERROR: hbase:acl

Here is some help for this command:

Grant users specific rights.

实际是表不存在，可参照下面网址解决问题，大体删了数据再生成一遍。。。

https://www.linuxidc.com/Linux/2018-03/151441.htm

1hbase hamster日志报错。

zookeeper.MetaTableLocator: Failed verification of hbase:meta,,1 at address=master,16020,1561549488917, exception=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on master,16020,1561555716564

The table hbase:acl does not exist in meta but has a znode. run hbck to fix inconsistencie

Zookeeper数据残留，删除zookeeper data下的version2文件夹，logs下的version2也删除，

WARN [WorkerSender[myid=1]:QuorumCnxManager@588] - Cannot open channel to 3 at election address slaver2/13.10.20.14:3888java.net.ConnectException: Connection refused

可能是端口被占了，或者数据残留……，删除data和logs下的version2文件夹。

hive初始化执行时，出错[ERROR] Terminal initialization failed; falling back to unsupported java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

原因：hadoop与hive的jline包版本不一致造成

解决：在hadoop安装路径下的share/hadoop/yarn/lib找到jline-xxx.jar

在hive路径下的lib里也找到jline-xxx.jar，会发现这两个包版本不一致，将其同步即可，用新版本的替换老版本（反之也行）

18. MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

Hive 的Metastore没有启动

出现

java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):

User root is not allowed to impersonate anonymous 错误。

修改hadoop 配置文件 etc/hadoop/core-site.xml,加入如下配置项

<name>hadoop.proxyuser.root.hosts</name>

</property>

<name>hadoop.proxyuser.root.groups</name>

</property>

Hadoop.proxyuser.root.hosts配置项名称中root部分为报错User:* 中的用户名部分

例如User: zhaoshb is not allowed to impersonate anonymous则需要将xml变更为如下格式

<name>hadoop.proxyuser.zhaoshb.hosts</name>

</property>

<name>hadoop.proxyuser.zhaoshb.groups</name>

</property>

重启hadoop

测试:./beeline -u 'jdbc:hive2://localhost:10000/userdb' -n username(替换为上述的用户名部分)

zookeeper启动报org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Address unresolved: master:388

Zoo.cfg配置文件含有空格键了

启动spark-sql报错：Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://13.10.20.12:3306/hive?createDatabaseIfNotExist=true, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------java.sql.SQLException: No suitable driver found for jdbc:mysql://13.10.20.12:3306/hive?createDatabaseIfNotExist=true

查看数据库权限
MySQL驱动，查看spark-x.x.x/jars目录下是否有驱动。
Jdbc：mysql 是不是写的有问题。

hue启动，hive报错，同上步骤找原因。

.hue里hive查询报错

Bad status for request TFetchResultsReq(fetchType=1, operationHandle=TOperationHandle(hasResultSet=False, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='%\x1d\x0f\x11\xce\xf4B\xd6\xb8\xb5\x8ddv\xceKV', guid="\xbc5\xd9|\xfa\x13J'\x86\x8e4\xb3\x87\xa0\xf4\x05")), orientation=4, maxRows=-1): TFetchResultsResp(status=TStatus(errorCode=0, errorMessage="Couldn't find log associated with operation handle: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=bc35d97c-fa13-4a27-868e-34b387a0f405]", sqlState=None, infoMessages=["*org.apache.hive.service.cli.HiveSQLException:Couldn't find log associated with operation handle: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=bc35d97c-fa13-4a27-868e-34b387a0f405]:25:24",

在hive-site.xml 添加配置文件：

<name>hive.server2.logging.operation.enabled</name>

</property>

重启hiveserver2，重启hue，再次运行

oozie运行hive例子一直停在那边不能运行。

Hue页面查看第一个mapreduce进程一直报：

INFO [communication thread] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1458755526820_9216_m_000000_0 is : 1.0

解决这个问题有2个方案，一个是配置多个队列，第二个是配置一个FairScheduler。

在可行的解决方案中，增加队列可能没那么快，而修改调度器为FairSchduler是比较现成和快的解决方案：

　　修改yarn-site.xml文件，添加如下：

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>

</property>

<name>yarn.scheduler.fair.preemption</name>

</property>

<name>yarn.scheduler.fair.preemption.cluster-utilization-threshold</name>

</property>

重启yarn，oozie再次运行hive查看。

　　开启公平调度器。之后我们再在8088端口查看的job的时候，会发现，虽然刚开始依然处于ACCEPTED的状态，但已经正常分配给拥有nodemanager节点的机子并初始化了。等一阵子后就开始RUNNING了。再也不会发生卡壳的问题。

ERROR, reason: Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [10001] Main class [org.apache.oozie.action.hadoop.HiveMain], exit code [1]

检查一下sql语句，可以把sql语句复制直接在hive命令行执行，查看是否可以运行，一般是因为表有问题，才会出现此现象，换张表试试。