spark读取parquet数据报异常: java.lang.NegativeArraySizeException

背景: 在执行spark任务的时候,中间有多次落盘,将数据以parquet格式写到hdfs。然后再将数据读取出来继续执行。执行到中间有如下报错:

 

	[spark] Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://master1:8020/user/xxx/part-00512-0462dbf5-98b2-41fa-925c-3ab53f9f060b-c000.snappy.parquet
	[spark]         at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
	[spark]         at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:215)
	[spark]         at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
	[spark]         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
	[spark]         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
	[spark]         at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
	[spark]         at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	[spark]         at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	[spark]         at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
	[spark]         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	[spark]         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	[spark]         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	[spark]         at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
	[spark]         at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
	[spark]         at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	[spark]         at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	[spark]         at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	[spark]         at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:204)
	[spark]         at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:52)
	[spark]         at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
	[spark]         at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
	[spark]         at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
	[spark] Caused by: java.lang.NegativeArraySizeException
	[spark]         at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
	[spark]         at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
	[spark]         at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
	[spark]         at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
	[spark]         ... 21 more

报错的parquet有些特殊情况,在写parquet的时候,对数据做了一次聚合操作,并对一个string字段做了 collect_set 操作,该字段内容为中文,所执行操作如下:

    # 对数据做聚合
    spark.sql("""
        select contact_info,
               count(id) as qy_num,
               collect_set(company_name) as qy_con_name  --该字段为中文字段
          from contact_v
         group by contact_info
    """).filter("qy_num >= 2").write.parquet('agg_path', mode='overwrite')
    spark\
        .read\
        .parquet('agg_path')\
        .createOrReplaceTempView("contact_agg_v")


    # 将聚合结果和前面的对应数据关联上
    spark.sql("""
        select a.id, 
               b.qy_num,
               b.qy_con_name, --这个地方会继续带上聚合的数组字段
               a.company_name
          from contact_v a join contact_agg_v b
            on a.contact_info = b.contact_info
    """).write\
        .parquet('combine_path', mode='overwrite')

    # 将聚合后的结果读取出来做后续处理,这里会包上面的错误
    spark.read\
        .parquet('combine_path')\
        .createOrReplaceTempView("contact_full_v")

 

经分析后,发现报错那里读取的parquet的每个文件的数据分布很不均匀,有的只有几K,有的超过10G,读取单个parquet的时候,如果文件过大,会出现这种情况。

在后面做个repartition,按照大致能均匀分布的字段做个 repartition 后续就不报这个错了

后记: 对于内容为中文的字段做聚合,连接操作时一定要注意,可能会有些性能方面问题或是莫名其妙的读写问题

 

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值