Hadoop中文问题

最新推荐文章于 2021-12-08 15:07:53 发布

biaorger

最新推荐文章于 2021-12-08 15:07:53 发布

阅读量1k

点赞数

分类专栏：大数据

大数据专栏收录该内容

5 篇文章 0 订阅

订阅专栏

从url中解析出中文,但hadoop中打印出来仍是乱码?我们曾经以为hadoop是不支持中文的，后来经过查看源代码，发现hadoop仅仅是不支持以gbk格式输出中文而己。

这是TextOutputFormat.class中的代码，hadoop默认的输出都是继承自FileOutputFormat来的，FileOutputFormat的两个子类一个是基于二进制流的输出，一个就是基于文本的输出TextOutputFormat。

public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
  protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
private static final String utf8 = “UTF-8″;//这里被写死成了utf-8
private static final byte[] newline;
static {
   try {
      newline = “\n”.getBytes(utf8);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + utf8 + ” encoding”);
   }
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
   this.out = out;
   try {
      this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + utf8 + ” encoding”);
   }
}
…
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
      Text to = (Text) o;
      out.write(to.getBytes(), 0, to.getLength());//这里也需要修改
   } else {
      out.write(o.toString().getBytes(utf8));
   }
}
…
}
可以看出hadoop默认的输出写死为utf-8，因此如果decode中文正确，那么将Linux客户端的character设为utf-8是可以看到中文的。因为hadoop用utf-8的格式输出了中文。
因为大多数数据库是用gbk来定义字段的，如果想让hadoop用gbk格式输出中文以兼容数据库怎么办？
我们可以定义一个新的类：
public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
  protected static class LineRecordWriter<K, V>
implements RecordWriter<K, V> {
//写成gbk即可
private static final String gbk = “gbk”;
private static final byte[] newline;
static {
   try {
      newline = “\n”.getBytes(gbk);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + gbk + ” encoding”);
   }
}
…
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
   this.out = out;
   try {
      this.keyValueSeparator = keyValueSeparator.getBytes(gbk);
   } catch (UnsupportedEncodingException uee) {
      throw new IllegalArgumentException(“can’t find ” + gbk + ” encoding”);
   }
}
…
private void writeObject(Object o) throws IOException {
   if (o instanceof Text) {
//       Text to = (Text) o;
//       out.write(to.getBytes(), 0, to.getLength());
//    } else {
      out.write(o.toString().getBytes(gbk));
   }
}
…
}
然后在mapreduce代码中加入conf1.setOutputFormat(GbkOutputFormat.class)
即可以gbk格式输出中文。

注释：后面版本升级，这个出现问题的概率不多了，在编程过程中需要注意编码问题，Hadoop开发编码一致，最好是 utf-8

biaorger

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop中文问题

[复制链接]问题导读：1.Hadoop开发中如何设置编码，你了解有几种？2.mapredue为什么要进行压缩？3.reduce个数如何设置才最合适？Hadoop版本不断升级，但是有时候，我们依然会遇到下面问题。1 中文问题从url中解析出中文,但hadoop中打印出来仍是乱码?我们曾经以为hadoop是不支持中文的，后来经过查看源代码，发现hadoop仅仅是不支持以gbk格式输出中文而己。这是
复制链接

扫一扫

专栏目录