背景: 在执行spark任务的时候,中间有多次落盘,将数据以parquet格式写到hdfs。然后再将数据读取出来继续执行。执行到中间有如下报错:
[spark] Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://master1:8020/user/xxx/part-00512-0462dbf5-98b2-41fa-925c-3ab53f9f060b-c000.snappy.parquet
[spark] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:223)
[spark] at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:215)
[spark] at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
[spark] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
[spark] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
[spark] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
[spark] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
[spark] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[spark] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
[spark] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[spark] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[spark] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[spark] at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
[spark] at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
[spark] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[spark] at scala.collection.Iterator$class.foreach(Iterator.scala:893)
[spark] at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
[spark] at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:204)
[spark] at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:52)
[spark] at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
[spark] at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
[spark] at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
[spark] Caused by: java.lang.NegativeArraySizeException
[spark] at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1064)
[spark] at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:698)
[spark] at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
[spark] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:194)
[spark] ... 21 more
报错的parquet有些特殊情况,在写parquet的时候,对数据做了一次聚合操作,并对一个string字段做了 collect_set 操作,该字段内容为中文,所执行操作如下:
# 对数据做聚合
spark.sql("""
select contact_info,
count(id) as qy_num,
collect_set(company_name) as qy_con_name --该字段为中文字段
from contact_v
group by contact_info
""").filter("qy_num >= 2").write.parquet('agg_path', mode='overwrite')
spark\
.read\
.parquet('agg_path')\
.createOrReplaceTempView("contact_agg_v")
# 将聚合结果和前面的对应数据关联上
spark.sql("""
select a.id,
b.qy_num,
b.qy_con_name, --这个地方会继续带上聚合的数组字段
a.company_name
from contact_v a join contact_agg_v b
on a.contact_info = b.contact_info
""").write\
.parquet('combine_path', mode='overwrite')
# 将聚合后的结果读取出来做后续处理,这里会包上面的错误
spark.read\
.parquet('combine_path')\
.createOrReplaceTempView("contact_full_v")
经分析后,发现报错那里读取的parquet的每个文件的数据分布很不均匀,有的只有几K,有的超过10G,读取单个parquet的时候,如果文件过大,会出现这种情况。
在后面做个repartition,按照大致能均匀分布的字段做个 repartition 后续就不报这个错了
后记: 对于内容为中文的字段做聚合,连接操作时一定要注意,可能会有些性能方面问题或是莫名其妙的读写问题