将json转为row落地存储为parquet:
- for type_name in types.value:
- print(type_name)
- type_data_set = lines.filter(lambda line: line['type'] == type_name)
- type_row = type_data_set.map(lambda line: Row(**line))
- schema_row = self.sqlContext.createDataFrame(type_row)
-
- schema_row.write.mode('overwrite').parquet(
- 'hdfs://ip:port/parquet/%s/year=%s/month=%s/day=%s/hour=%s' % \
- (type_name, self.year, self.month, self.day, self.hour)
- )
异常:
- Caused by: java.lang.IndexOutOfBoundsException: Trying to write more fields than contained in row (15 > 12)
- at org.apache.spark.sql.execution.datasources.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:261)
- at org.apache.spark.sql.execution.datasources.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:257)
- at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
- at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
- at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
- at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.writeInternal(ParquetRelation.scala:99)
- at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:242)
- ... 8 more
同为zi类型的两条记录一条12个字段,另一条15个字段
- {"time":"2016-06-06 17:25:14","message":{"channel":3,"containerId":"16","sendUserId":"2611","objectName":"RC:TxtMsg","count":49,"type":"zi","uuid":"-1","appId":"100000","nodeId":"GRM_NODE_0","userId":"2611","time":1465205114814,"ipAddress":"0","sdkVersion":"2.6.2","osName":"Android","deviceId":"0"}}
- {"time":"2016-06-06 17:41:31","message":{"channel":0,"count":0,"type":"zi","uuid":"","appId":"100000","nodeId":"MSG_NODE_2","userId":"2626","time":1465206091272,"ipAddress":"0","sdkVersion":"2.6.1","osName":"0","deviceId":"1"}}
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29754888/viewspace-2119617/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29754888/viewspace-2119617/