spark-parquet列存储之:文件存储细节之:RowWriteSupport和RecordWriter

RowWriteSupport

RowWriteSupport继承自WriteSupport(该文件位于hadoop.parquet jar),为数据写入提供WriteContext支持

field:

writer:RecordConsumer
数据的消费者,负责写入数据,其有三个子类:
ValidatingRecordConsumer,负责校验工作,然后传递给下一个消费者执行同样的操作
RecordConsumerLoggingWrapper,负责记录日志,然后传递给下一个消费者执行同样的操作
MessageColumnIORecordConsumer,负责写入数据

attributes: Seq[Attribute]

内部含有Schema信息,FieldName,DataType,isNullable

function

初始化过程:

init(configuration: Configuration): WriteSupport.WriteContext


val origAttributesStr: String = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA)
    val metadata = new JHashMap[String, String]()
    metadata.put(RowReadSupport.SPARK_METADATA_KEY, origAttributesStr)

    if (attributes == null) {
      attributes = ParquetTypesConverter.convertFromString(origAttributesStr)
    }

    log.debug(s"write support initialized for requested schema $attributes")
    ParquetRelation.enableLogForwarding()
    new WriteSupport.WriteContext(ParquetTypesConverter.convertFromAttributes(attributes), metadata)

获取数据的Schema信息,并传递给WriteContext
获取ParquetTypeInfo的详细过程如下:
  def fromPrimitiveDataType(ctype: DataType): Option[ParquetTypeInfo] = ctype match {
    case StringType => Some(ParquetTypeInfo(
      ParquetPrimitiveTypeName.BINARY, Some(ParquetOriginalType.UTF8)))
    case BinaryType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BINARY))
    case BooleanType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.BOOLEAN))
    case DoubleType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.DOUBLE))
    case FloatType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FLOAT))
    case IntegerType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))
    // There is no type for Byte or Short so we promote them to INT32.
    case ShortType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))
    case ByteType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT32))
    case LongType => Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
    case DecimalType.Fixed(precision, scale) if precision <= 18 =>
      // TODO: for now, our writer only supports decimals that fit in a Long
      Some(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
        Some(ParquetOriginalType.DECIMAL),
        Some(new DecimalMetadata(precision, scale)),
        Some(BYTES_FOR_PRECISION(precision))))
    case _ => None
  }

然后将ParquetTypeInfo包装成MessageType
  def convertFromAttributes(attributes: Seq[Attribute]): MessageType = {
    val fields = attributes.map(
      attribute =>
        fromDataType(attribute.dataType, attribute.name, attribute.nullable))
    new MessageType("root", fields)
  }




RecordWriter

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值