hadoop 中文乱码问题解决

当做分析的时候有时候会接触到中文。因为hadoop默认都是UTF8格式。。读了一下源码,又结合网上的说法重新了一下FileOutPutFormat

其实hadoop原本在操作的时候一直操作的是 字节组的载入。问题出在我们手写部分 Text.toString() 方法,他被默认转换成UTF8格式的了。

public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {	

	  protected static class LineRecordWriter<K, V>
	    extends RecordWriter<K, V> {
	    private static final String gbk = "gbk";
	    private static final byte[] newline;
	    static {
	      try {
	        newline = "\n".getBytes(gbk);
	      } catch (UnsupportedEncodingException uee) {
	        throw new IllegalArgumentException("can't find " + gbk + " encoding");
	      }
	    }

	    protected DataOutputStream out;
	    private final byte[] keyValueSeparator;

	    public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
	      this.out = out;
	      try {
	        this.keyValueSeparator = keyValueSeparator.getBytes(gbk);
	      } catch (UnsupportedEncodingException uee) {
	        throw new IllegalArgumentException("can't find " + gbk + " encoding");
	      }
	    }

	    public LineRecordWriter(DataOutputStream out) {
	      this(out, "\t");
	    }

	    /**
	     * Write the object to the byte stream, handling Text as a special
	     * case.
	     * @param o the object to print
	     * @throws IOException if the write throws, we pass it on
	     */
	    private void writeObject(Object o) throws IOException {
	      if (o instanceof Text) {
	       
	        out.write(o.toString().getBytes(gbk));
	      }
	    }

	    public synchronized void write(K key, V value)
	      throws IOException {

	      boolean nullKey = key == null || key instanceof NullWritable;
	      boolean nullValue = value == null || value instanceof NullWritable;
	      if (nullKey && nullValue) {
	        return;
	      }
	      if (!nullKey) {
	        writeObject(key);
	      }
	      if (!(nullKey || nullValue)) {
	        out.write(keyValueSeparator);
	      }
	      if (!nullValue) {
	        writeObject(value);
	      }
	      out.write(newline);
	    }

	    public synchronized 
	    void close(TaskAttemptContext context) throws IOException {
	      out.close();
	    }
	  }

	  public RecordWriter<K, V> 
	         getRecordWriter(TaskAttemptContext job
	                         ) throws IOException, InterruptedException {
	    Configuration conf = job.getConfiguration();
	    boolean isCompressed = getCompressOutput(job);
	    String keyValueSeparator= conf.get("mapred.textoutputformat.separator",
	                                       "\t");
	    CompressionCodec codec = null;
	    String extension = "";
	    if (isCompressed) {
	      Class<? extends CompressionCodec> codecClass = 
	        getOutputCompressorClass(job, GzipCodec.class);
	      codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
	      extension = codec.getDefaultExtension();
	    }
	    Path file = getDefaultWorkFile(job, extension);
	    FileSystem fs = file.getFileSystem(conf);
	    if (!isCompressed) {
	      FSDataOutputStream fileOut = fs.create(file, false);
	      return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
	    } else {
	      FSDataOutputStream fileOut = fs.create(file, false);
	      return new LineRecordWriter<K, V>(new DataOutputStream
	                                        (codec.createOutputStream(fileOut)),
	                                        keyValueSeparator);
	    }
	  }
	}

 

手写的mapper 方法中要注意转换成GBK格式,以确保从头到尾都是GBK操作。

String str=new String(value.getBytes[],"GBK") ;

 

这样中文问题就解决了

 

转载于:https://www.cnblogs.com/surongyou/archive/2013/03/09/2952083.html

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值