spark读取hive表异常

最新推荐文章于 2024-04-11 14:14:58 发布

盛源_01

最新推荐文章于 2024-04-11 14:14:58 发布

阅读量2.6k

点赞数 1

分类专栏： spark hive 文章标签： spark hive

本文链接：https://blog.csdn.net/weixin_40829577/article/details/109001268

版权

spark 同时被 2 个专栏收录

22 篇文章 1 订阅

订阅专栏

hive

8 篇文章 1 订阅

订阅专栏

01 可能导致异常的操作

1) 用hive命令msck repair table table_name修复表分区

2) alter table tb add columns(col1 string) cascade; 增加新列

02 原因

为了提高性能spark对元数据做了缓存，如果外部系统更新了元数据，spark使用时要更新缓存过的该表元数据.

/**
* Invalidates and refreshes all the cached data and metadata of the given table. For performance
* reasons, Spark SQL or the external data source library it uses might cache certain metadata
* about a table, such as the location of blocks. When those change outside of Spark SQL, users
* should call this function to invalidate the cache.
*
* If this table is cached as an InMemoryRelation, drop the original cached version and make the
* new version cached lazily.
*
* @param tableName is either a qualified or unqualified name that designates a table/view.
*                  If no database identifier is provided, it refers to a temporary view or
*                  a table/view in the current database.
* @since 2.0.0
*/
def refreshTable(tableName: String): Unit

03 解决方案

1. 启动客spark-shell客户端

1) 分配executor-memory/driver-memory 足够的内存, 否则会内存溢出;

2) 并发度不宜过大, 否则会超过允许的并发访问次数;

3) 分区不宜过多, 否则内存溢出;

spark-shell \
	--name ShyTestError \
	--master yarn \
	--deploy-mode client \
	--num-executors 2 \
	--executor-memory 12G \
	--executor-cores 2 \
	--driver-memory 8G \
    --conf spark.driver.maxResultSize=4G \
	--conf spark.dynamicAllocation.enabled=false \
	--conf spark.executor.memoryOverhead=8G \
	--conf spark.default.parallelism=8 \
	--conf spark.sql.shuffle.partitions=8

2. 刷新对应表的元数据

spark.catalog.refreshTable("table_name")

20 解决增加字段spark读取异常

1 问题

1) 执行增加字段命令

alter table table_name add columns(ekey string);

2) spark读表异常

select * from table_name where dt = '20211118' limit 20;

Inferring case-sensitive schema for table  (inference mode: INFER_AND_SAVE)
WARN HiveExternalCatalog: The table schema given by Hive metastore(struct<server_time:bigint,ip:string>) is different from the schema when this table was created by Spark SQL(struct<server_time:bigint>). We have to fall back to the table schema from Hive metastore which is not case preserving.
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 23554"...

错误含义:

用hive命令增加一个字段后, 导致hive表的schema和spark-SQL的schema不一致;

spark启用了自动推测机制, 分区过多导致了内存溢出;

3) 其他现象

a. 读表异常只在spark上出现, 在hive和superset正常;

b. 表的分区很多数据量很大时才出现, 相关链接;

2 show create table时的发现

1) 执行展示建表命令

show create table table_name ;

CREATE EXTERNAL TABLE `table_name `(
`server_time` bigint COMMENT '服务器接收时间',
`ip` string COMMENT '客户端ip',
)
COMMENT '测试athena通用dwd'
PARTITIONED BY (
`dt` string COMMENT ''
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3_path'
TBLPROPERTIES (
'last_modified_by'='houyuan.sheng',
'last_modified_time'='1637478001',
'lifecycle'='-1',
'owner'='houyuan.sheng',
'parquet.compress'='snappy',
'spark.sql.create.version'='2.2 or prior',
'spark.sql.sources.schema.numPartCols'='1',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"server_time","type":"long","nullable":true,"metadata":{"comment":"服务器接收时间"}},{"name":"dt","type":"string","nullable":true,"metadata":{"comment":""}}]}',
'spark.sql.sources.schema.partCol.0'='dt'

)

注意: 只有spark向hive表写入数据或刷新表的元数据后, 红色的属性才出现;

2) 新发现

hive的表属性中, 包含spark的一些信息, 是否可以更改表属性, 解决读取异常;

3) 更改表属性

-- 把spark-sql和hive的schema手动修改一致
alter table `table_name ` set tblproperties('spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"server_time","type":"long","nullable":true,"metadata":{"comment":"服务器接收时间"}},{"name":"ip","type":"string","nullable":true,"metadata":{"comment":"客户端ip"}}{"name":"dt","type":"string","nullable":true,"metadata":{"comment":""}}');

-- 可以正常查询了
select * from table_name  where dt = '20211116' limit 200;

3 解决方案

1) 表分区很多时

手动修改hive表中spark写入的schema表属性, 可以修复hive表增加字段导致的spark读表异常;

2) 表分区少时

a. spark执行select或insert, 校正写入hive的schema

b. spark对表refresh

3) 关闭推测功能(每个spark程序都要设置, 不实用)

set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER;

99 关键字

1) 分区失效;

2) spark读表异常;

盛源_01

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
spark读取hive表异常

1 原因为了性能spark对元数据做了缓存，如果外部系统更新了元数据，spark使用时要更新缓存过的该表元数据./*** Invalidates and refreshes all the cached data and metadata of the given table. For performance* reasons, Spark SQL or the external data source library it uses might cache certain metadata
复制链接

扫一扫

专栏目录