Java的String不能处理中文utf-8编码

最新推荐文章于 2024-07-26 03:28:27 发布

ahugeduck

最新推荐文章于 2024-07-26 03:28:27 发布

阅读量1.1w

点赞数 1

文章标签： java string utf-8 camus sequencefile

本文链接：https://blog.csdn.net/demo_zj/article/details/46692591

版权

Sequence File 文件格式支持文件分割，所以适合map-reduce作业。最近有一个项目，把kafka的protobuf数据写到hdfs，方便下游的离线作业做数据分析。

在kafka中，protobuf序列化成了byte数组（message就是byte数组）。这个时候在linkedin的camus（linkedin开源的一个把kafka数据写到hdfs 的工具）中配置作业把kafka message以sequence file的格式写到hdfs。Sequence file的key是org.apache.hadoop.io.LongWritable，value是org.apache.hadoop.io.Text。

很顺利，我们把数据写到了hdfs，然后我自己写一个pig udf去解析protobuf的数据。Pig本身有一个udf去读取sequencefile：org.apache.pig.piggybank.storage.SequenceFileLoader。因为存放的是Text，所以pig读取的时候就转成了chararray。然后我就遇到了如下的错误：

bad record, bad formatcom.google.protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either than the input has been truncated or that an embedded message misreported its own length.</