解决了一个Hadoop输出中文乱码的问题,简单来说就是注意编码,写出String的时候不要让Java插手。
简单流程:
- Map: readFields(ResultSet result)——从Mysql中读出;
- Map:write(DataOutput out)——输出Map结果;
- Reduce: readFields(DataInput in)——读回Map输出的中间结果;
从Log中看到Map从Mysql读出的字符串内容是正确的,但Reduce读回来就是乱码了。原来的代码如下:
public void readFields(DataInput in) throws IOException { super.readFields(in); this.id = in.readLong(); int l1 = in.readInt(); byte b1[] = new byte[l1]; in.readFully(b1); name = new String(b1); } public void write(DataOutput out) throws IOException { super.write(out); out.writeLong(this.id); out.writeInt(name.length()); out.writeBytes(name); }
改成这样就正确了:
public void write(DataOutput out) throws IOException { super.write(out); out.writeLong(this.id); out.writeInt(name.getBytes("utf8").length); out.write(name.getBytes("utf8")); }
查了一下Java的帮助,原因是write(byte[])直接把每个Byte写出,而writeBytes(String)是以字符串中的字符为单位来写出的,所以被动了手脚。
write(byte[]):
Writes to the output stream all the bytes in array
b
. Ifb
isnull
, aNullPointerException
is thrown. Ifb.length
is zero, then no bytes are written. Otherwise, the byteb[0]
is written first, thenb[1]
, and so on; the last byte written isb[b.length-1]
.
writeBytes(String s) :
Writes a string to the output stream. For every character in the string
s
, taken in order, one byte is written to the output stream. Ifs
isnull
, aNullPointerException
is thrown.