java bytes copy_Java Text.copyBytes方法代码示例

最新推荐文章于 2024-07-07 03:34:09 发布

一只萌皮皮

最新推荐文章于 2024-07-07 03:34:09 发布

阅读量203

点赞数

文章标签： java bytes copy

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_34932795/article/details/115067322

版权

此篇博客介绍了如何使用Apache Hadoop的Text类处理文件时，通过检测并跳过UTF-8字节顺序标记(BOM)，以提高读取效率和兼容性。作者定义了一个方法stripUtfByteOrderMark，确保了文本流的正确解析。

摘要由CSDN通过智能技术生成

import org.apache.hadoop.io.Text; //导入方法依赖的package包/类

private int skipUtfByteOrderMark(Text value) throws IOException {

// Strip BOM(Byte Order Mark)

// Text only support UTF-8, we only need to check UTF-8 BOM

// (0xEF,0xBB,0xBF) at the start of the text stream.

int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,

Integer.MAX_VALUE);

int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));

// Even we read 3 extra bytes for the first line,

// we won't alter existing behavior (no backwards incompat issue).

// Because the newSize is less than maxLineLength and

// the number of bytes copied to Text is always no more than newSize.

// If the return size from readLine is not less than maxLineLength,

// we will discard the current line and read the next line.

pos += newSize;

int textLength = value.getLength();

byte[] textBytes = value.getBytes();

if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&

(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {

// find UTF-8 BOM, strip it.

LOG.info("Found UTF-8 BOM and skipped it");

textLength -= 3;

newSize -= 3;

if (textLength > 0) {

// It may work to use the same buffer and not do the copyBytes

textBytes = value.copyBytes();

value.set(textBytes, 3, textLength);

} else {

value.clear();

}

}

return newSize;

}

一只萌皮皮

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。