MapReduce清洗数据乱码问题

最新推荐文章于 2022-07-28 13:41:15 发布

置顶

吃提子要吐皮

最新推荐文章于 2022-07-28 13:41:15 发布

阅读量1.2k

点赞数 1

文章标签： MapReduce 乱码 hadoop

本文链接：https://blog.csdn.net/weixin_43679675/article/details/84593021

版权

hadoop的MapReduce读取文件处理数据时遇到的中文字乱码问题

MapReduce读取数据在Map端的map方法里进行拆分解析，map方法读取到的每行数据类型为Text，hadoop中的Text类内部编码格式是写死的UTF-8格式，需要在map方法读取到数据后直接进行编码转换。
即解决方案为：
用String s=new String(lineValue.getBytes(),0,lineValue.getLength(),“GBK”);替换String s=lineValue.toString()；（lineValue即为map方法中读取的一行文本数据Text的对象）
详细解释如下：

public class Map extends Mapper<LongWritable, Text, Text, Text>{
	@Override
	public void map(LongWritable lineNum,Text lineValue,Context context) throws IOException, InterruptedException {
		String s=new String(lineValue.getBytes(),0,lineValue.getLength(),"GBK");
		//String s=lineValue.toString()
	 /*直接使用toString会出现乱码，这是由于Text这个Writable类型造成的，它是文本按照UTF-8格式编码的Writable，
	 而Java中的String是Unicode字符编码。所以直接使用lineValue.toString()方法，会默认其中的字符都是按照UTF-8进行解码的
	 原本GBK编码的数据使用Text读入后直接使用该方法就会变成乱码。

最低0.47元/天解锁文章

吃提子要吐皮

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
MapReduce清洗数据乱码问题

/*This class stores text using standard UTF8 encoding.*/public class Text extends BinaryComparable implements WritableComparable&amp;amp;lt;BinaryComparable&amp;amp;gt; { private static ThreadLocal&amp;amp;lt;CharsetEncoder&am
复制链接

扫一扫