java bytes copy_Java Text.copyBytes方法代码示例

此篇博客介绍了如何使用Apache Hadoop的Text类处理文件时,通过检测并跳过UTF-8字节顺序标记(BOM),以提高读取效率和兼容性。作者定义了一个方法stripUtfByteOrderMark,确保了文本流的正确解析。
摘要由CSDN通过智能技术生成

import org.apache.hadoop.io.Text; //导入方法依赖的package包/类

private int skipUtfByteOrderMark(Text value) throws IOException {

// Strip BOM(Byte Order Mark)

// Text only support UTF-8, we only need to check UTF-8 BOM

// (0xEF,0xBB,0xBF) at the start of the text stream.

int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,

Integer.MAX_VALUE);

int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));

// Even we read 3 extra bytes for the first line,

// we won't alter existing behavior (no backwards incompat issue).

// Because the newSize is less than maxLineLength and

// the number of bytes copied to Text is always no more than newSize.

// If the return size from readLine is not less than maxLineLength,

// we will discard the current line and read the next line.

pos += newSize;

int textLength = value.getLength();

byte[] textBytes = value.getBytes();

if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&

(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {

// find UTF-8 BOM, strip it.

LOG.info("Found UTF-8 BOM and skipped it");

textLength -= 3;

newSize -= 3;

if (textLength > 0) {

// It may work to use the same buffer and not do the copyBytes

textBytes = value.copyBytes();

value.set(textBytes, 3, textLength);

} else {

value.clear();

}

}

return newSize;

}

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值