IDEA本地Spark开发读取云主机Hive异常剖析

18 篇文章 0 订阅

IDEA本地Spark操作云主机Hive异常剖析

1.问题背景

1.Hive是搭建在云主机上的伪分布式

​ 公网 IP:47.101.xxx.xxx
​ 内网 IP:172.19.35.154
​ 主机名:hadoop001

2.Spark开发在本地的IDEA开发

​ 本地的的hive-site.xml配置如下:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>

        <!--alterdb为要创建的数据库名,注意字符集设置-->
        <value>jdbc:mysql://47.101.xxx.xxx:3306/hive_basic?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <!--MySQL登录账户名-->
        <value>root</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <!--MySQL登录密码-->
        <value>123456</value>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
    </property>

    <property>
        <name>hive.server2.thrift.port</name>
        <value>10000</value>
    </property>
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>hadoop001</value>
    </property>
    <property>
        <name>hive.server2.long.polling.timeout</name>
        <value>5000</value>
    </property>
    <property>
        <name>hive.server2.authentication</name>
        <value>NONE</value>
    </property>

    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>

</configuration>

​ 本地的core-site.xml配置如下:

<configuration>
<property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop001:9001</value>
</property>
<property>
        <name>hadoop.tmp.dir</name>
        <value>hdfs://hadoop001:9001/hadoop/tmp</value>
</property>
</configuration>

​ 本地的hdfs-site.xml配置如下:

<configuration>
<property>
       <name>dfs.replication</name>
       <value>1</value>
</property>
</configuration>
3.云主机的hosts文件配置

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

172.19.35.154 hadoop001

4.本地hosts文件配置

47.101.xxx.xxx hadoop001

2.问题症状

操作代码:
package com.ruozedata.sparksql.sparksql

import org.apache.spark.sql.SparkSession

object SparkSqlApp {

  def main (args: Array[String]): Unit = {
  
    val spark = SparkSession.builder()
      .appName("SparkSqlApp")
      .master("local[*]")
      .config("dfs.client.use.datanode.hostname", "true")
      .enableHiveSupport()
      .getOrCreate()

    spark.sql("show databases")
    spark.sql("use homework")
    spark.sql("show tables").show(false)
    spark.sql("select * from user_click").show(10,false)
  }

}

在云主机上通过Spark-shell命令操作上面的代码,没有问题,hdfs hive均正常.

scala> spark.sql("show databases").show()
19/05/27 16:44:56 ERROR metastore.ObjectStore: Version information found in metastore differs 1.1.0 from expected schema version 1.2.0. Schema verififcation is disabled hive.metastore.schema.verification so setting version.
+------------+
|databaseName|
+------------+
|     default|
|    homework|
|        test|
+------------+

scala> spark.sql("show tables").show()
+--------+----------+-----------+
|database| tableName|isTemporary|
+--------+----------+-----------+
|homework|user_click|      false|
+--------+----------+-----------+

scala> spark.sql("select * from user_click").show(10,false)
+-------+--------------------------------+-------------------+-------+----------+----------+
|user_id|session_id                      |action_time        |city_id|product_id|day       |
+-------+--------------------------------+-------------------+-------+----------+----------+
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:01:56|1      |72        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:52:26|1      |68        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:17:03|1      |40        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:32:07|1      |21        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:26:06|1      |63        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:03:11|1      |60        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:43:43|1      |30        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:09:58|1      |96        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:18:45|1      |71        |2016-05-05|
|95     |2bf501a7637549c89cf55342331b15db|2016-05-05 21:42:39|1      |8         |2016-05-05|
+-------+--------------------------------+-------------------+-------+----------+----------+
only showing top 10 rows
但是在本地IDEA运行报错了

报错信息如下

19/05/27 17:23:44 INFO HadoopRDD: Input split: hdfs://hadoop001:9000/user/hive/warehouse/homework.db/user_click/day=2016-05-05/user_click.txt:0+690923
19/05/27 17:23:44 INFO CodeGenerator: Code generated in 12.8685 ms
19/05/27 17:24:05 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
	at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3090)
	at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
	at 
19/05/27 17:24:05 INFO DFSClient: Could not obtain BP-773342264-47.103.47.252-1558318800816:blk_1073741892_1068 from any node: java.io.IOException: No live nodes contain block BP-773342264-47.103.47.252-1558318800816:blk_1073741892_1068 after checking nodes = [172.19.35.154:50010], ignoredNodes = null No live nodes contain current block Block locations: 172.19.35.154:50010 Dead nodes:  172.19.35.154:50010. Will get new block locations from namenode and retry...
19/05/27 17:24:05 WARN DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 758.8464188191888 msec.
19/05/27 17:24:27 WARN BlockReaderFactory: I/O error constructing remote block reader.
java.net.ConnectException: Connection timed out: no further information
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
	at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3090)
	at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778)
	at 

3.问题分析

1.本地 Shell 可以正常操作,排除集群搭建和进程没有启动的问题

2.云主机没有设置防火墙,排除防火墙没关的问题

3.云服务器防火墙开放了 DataNode 用于数据传输服务端口 默认是 50010

由于本地测试和云主机不在一个局域网,hadoop配置文件是以内网ip作为机器间通信的ip。在这种情况下,我们能够访问到namenode机器,namenode会给我们数据所在机器的ip地址供我们访问数据传输服务,但是当写数据的时候,NameNode 和DataNode 是通过内网通信的,返回的是datanode内网的ip,我们无法根据该IP访问datanode服务器。

观察这部分报错信息可以发现问题:

Block locations: 172.19.35.154:50010 Dead nodes:  172.19.35.154:50010. Will get new block locations from namenode and retry...

从报错信息中可以看出,连接不到172.19.35.154:50010,也就是datanode的地址,因为外网必须访问“47.101.xxx.xxx:50010”才能访问到datanode。
为了能够让开发机器访问到hdfs,我们可以通过域名访问hdfs,让namenode返回给我们datanode的域名。

4.解决办法

通过查阅资料,在sparkConf的设置中增加config(“dfs.client.use.datanode.hostname”, “true”)属性,表示datanode之间的通信也通过域名方式.

val spark = SparkSession.builder()
    .appName("SparkSqlApp")
    .master("local[*]")
    .config("dfs.client.use.datanode.hostname", "true")
    .enableHiveSupport()
    .getOrCreate()

解决过程中,还看到另外一个办法,跟我报错相同,但是我操作之后并未解决,这里也贴出来.

在hdfs-site.xml中添加如下配置:

<property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>true</value>
</property>

这样能够使得更换内网IP变得十分简单、方便,而且可以让特定datanode间的数据交换变得更容易。但与此同时也存在一个副作用,当DNS解析失败时会导致整个Hadoop不能正常工作,所以要保证DNS的可靠.

总结:解决办法就是将默认的通过IP访问,改为通过域名方式访问。

参考博客:https://www.jianshu.com/p/1fae94132427

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值