csv文件下载出现乱码

最新推荐文章于 2024-05-14 14:38:59 发布

清澈的泉水

最新推荐文章于 2024-05-14 14:38:59 发布

阅读量7.9k

点赞数 1

文章标签： csv encoding byte dreamweaver exception html

本文链接：https://blog.csdn.net/fwch1982/article/details/7848495

版权

最近有个问题，下载csv文件下来，用excel打开的时候，出现乱码，原因是编码是gbk，用下文所说的用utf-8编码，也不行，不知道是为什么,可能是因为我的office设置主要语言是简体中文的原因。用

OutputStreamWriter fos = new OutputStreamWriter(

new FileOutputStream(new File("c://2.csv")), "UTF-16LE");

fos.write(0xFEFF);

fos.write("你好 ");

fos.flush();

fos.close();

这种写法正常，没有出现乱码。

BOM(Byte Order Mark)，是 UTF编码方案里用于标识编码的标准标记，在 UTF-16里本来是 FF FE，变成 UTF-8就成了 EF BB BF。这个标记是可选的，因为 UTF8字节没有顺序，所以它可以被用来检测一个字节流是否是 UTF-8编码的。微软做这种检测，但有些软件不做这种检测，而把它当作正常字符处理。

微软在自己的 UTF-8格式的文本文件之前加上了 EF BB BF三个字节 , windows上面的 notepad等程序就是根据这三个字节来确定一个文本文件是 ASCII的还是 UTF-8的 , 然而这个只是微软暗自作的标记 , 其它平台上并没有对 UTF-8文本文件做个这样的标记。

也就是说一个 UTF-8文件可能有 BOM，也可能没有 BOM，那么怎么区分呢？三种方法。 1，用 UltraEdit-32打开文件，切换到十六进制编辑模式，察看文件头部是否有 EF BB BF。 2，用 Dreamweaver打开，察看页面属性，看“包括 Unicode签名 BOM”前面是否有个勾。 3，用 Windows的记事本打开，选择 “另存为”，看文件的默认编码是 UTF-8还是 ANSI，如果是 ANSI则不带 BOM。

下面是部分 BOM的介绍

在 UCS 编码中有一个叫做 "ZERO WIDTH NO-BREAK SPACE"的字符，它的编码是 FEFF。而 FFFE在 UCS中是不存在的字符，所以不应该出现在实际传输中。 UCS规范建议我们在传输字节流前，先传输字符 "ZERO WIDTH NO-BREAK SPACE"。这样如果接收者收到 FEFF，就表明这个字节流是 Big-Endian的；如果收到 FFFE，就表明这个字节流是 Little- Endian的。因此字符 "ZERO WIDTH NO-BREAK SPACE"又被称作 BOM。

UTF-8不需要 BOM来表明字节顺序，但可以用 BOM来表明编码方式。字符 "ZERO WIDTH NO-BREAK SPACE"的 UTF-8编码是 EF BB BF。所以如果接收者收到以 EF BB BF开头的字节流，就知道这是 UTF-8编码了。

Windows就是使用 BOM来标记文本文件的编码方式的。

最近处理一个xml文件，用editplus看着一点问题没有，但用程序处理就会报错，错误信息翻译后如下：

属性值中不能使用字符 '<'。处理资源 'file:///C:/Documents and Settings/renyang/桌面/BOM实例_utf-8.xml' 时出错。第 28 行，位置: 13

但实际属性值里根本没有'<'啊，很是奇怪！猜测可能是字符问题，重写xml文件，内容和之前的一样，在运行，就通过了。比较两个xml文件，发现错误的文件前几个字节为“EF BB BF”，这就是表示utf8的bom，但xml文件第一行却是<?xml version="1.0" encoding="gbk"?>

由此可知问题原因：
文件实际为带bom的utf-8编码，但包括浏览器和读xml的jar包等都是按照xml第一行<?xml version="1.0" encoding="gbk"?>中的gbk来读取文件的，由此可带来一些不可预知的问题。

但为什么editplus读取就没问题呢？
editplus等编辑器类软件，会把所有文件都当成文本文件来处理，不会根据它是不是xml文件而采用不同的处理方法（字符集检测方法），所以它会按照bom的内容，即utf-8来读取

utf8和utf-8?
一般软件都会把这两种编码表示等同，但也遇到个别情况，比如ant，IE8等，它们只会认utf-8编码，而对utf8报错，说不认识，以后为了避免这些琐碎，还是统一成utf-8吧（PS：大小写都可以）

怎样查看文件是否含有bom？
用程序读取一个文件的前几个字节，看是不是bom。更简单的方法就是用一些高级的编辑器，比如utraedit的16进制方式查看，或是下面转载blog中说的EmEdit中的另存为来查看。

下面转载一篇blog(http://blog.sina.com.cn/s/blog_3e9d2b350100as0b.html)，对bom介绍的比较详细。

————————————————————————————————————————————

工作需要我用程序生成一个html文件。
由于服务器端使用apache+Tomcat来执行html和jsp文件。
开始生成html文件放在apache目录下，页面无法默认正常识别我页面设置的编码。

必须手动在浏览器上选择Encoding->简体中文（GB2312）才可以正常显示。
这样当然是不行了。
由于我们原来有一个页面是可以正常显示中文的，查看了一下，是UTF－8的格式，于是我也修改程序。
a.修改了页面的编码声明：

b.修改了写字节流的一个方法：
public void htmlWrite(String charsetName) {
        try {
            out = new BufferedWriter(new OutputStreamWriter(
                        new FileOutputStream(outFileName), "UTF-8"));
            out.write(res);
            out.flush();

            if (out != null) {
                out.close();
            }
        } catch (Exception e) {
            try {
                if (out != null) {
                    out.close();
                }
            } catch (IOException e1) {
                System.out.print("write errors!" + e);
            }

            System.out.print("write errors!" + e);
        }
    }
这样，我又生成了一个html，放在服务器下面，可问题又来了，还是无法正常显示，即浏览器无法默认识别为UTF-8的编码方式。奇怪，使用EmEditor打开，和好用的那个页面对比。没有任何问题。唯一的区别在于：
    我生成的那个html文件被EmEditor认为UTF-8 with Signature。而好用的那个html文件被EmEditor认为UTF-8 without Signature.
    对于这两种UTF－8格式的转换，我查看了网上信息，点击记事本，EmEditor等文本编辑器的另存为，当选择了UTF-8的编码格式时，Add a Unicode Signature(BOM)这个选项被激活，只要选择上，我的文件就可以存为UTF-8 with Signature的格式。可是，问题就在于，我用java怎么让我的文件直接生成为 UTF-8 with Signature的格式。
    开始上google搜索UTF-8 with Signature,BOM,Add a Unicode Signature等关键字。
http://www.unicode.org/unicode/faq/utf_bom.html#BOM
我大致了解了他们两个的区别。
Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol.
http://mindprod.com/jgloss/bom.html
BOM
Byte Order Marks are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. These codes also tell whether the encoding is 8, 16 or 32 bit. You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros. Unicode Endian Markers
Byte-order mark Description
EF BB BF UTF-8
FF FE UTF-16 aka UCS-2, little endian
FE FF UTF-16 aka UCS-2, big endian
00 00 FF FE UTF-32 aka UCS-4, little endian.
00 00 FE FF UTF-32 aka UCS-4, big-endian.
There are also variants of these encodings that have an implied endian marker.
Unfortunately, often applications, even Javac.exe, choke on these byte order marks. Java Readers don't automatically filter them out. There is not much you can do but manually remove them.

http://cache.baidu.com/c?word=java%2Cbom&url=http%3A//tgdem530%2Eblogchina%2Ecom/&b=0&a=1&user=baidu
c、UTF的字节序和BOM
UTF-8以字节为编码单元，没有字节序的问题。UTF-16以两个字节为编码单元，在解释一个UTF-16文本前，首先要弄清楚每个编码单元的字节序。例如收到一个“奎”的Unicode编码是594E，“乙”的Unicode编码是4E59。如果我们收到UTF-16字节流“594E”，那么这是“奎”还是“乙”？

Unicode规范中推荐的标记字节顺序的方法是BOM。BOM不是“Bill Of Material”的BOM表，而是Byte Order Mark。BOM是一个有点小聪明的想法：

在UCS编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符，它的编码是FEFF。而FFFE在UCS中是不存在的字符，所以不应该出现在实际传输中。UCS规范建议我们在传输字节流前，先传输字符"ZERO WIDTH NO-BREAK SPACE"。

这样如果接收者收到FEFF，就表明这个字节流是Big-Endian的；如果收到FFFE，就表明这个字节流是Little-Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被称作BOM。

UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8编码是EF BB BF（读者可以用我们前面介绍的编码方法验证一下）。所以如果接收者收到以EF BB BF开头的字节流，就知道这是UTF-8编码了。

Windows就是使用BOM来标记文本文件的编码方式的。

原来BOM是在文件的开始加了几个字节作为标记。有了这个标记，一些协议和系统才能识别。好，看看怎么加上这写字节。
终于在这里找到了
http://mindprod.com/jgloss/encoding.html
UTF-8
8-bit encoded Unicode. neé UTF8. Optional marker on front of file: EF BB BF for reading. Unfortunately, OutputStreamWriter does not automatically insert the marker on writing. Notepad can't read the file without this marker. Now the question is, how do you get that marker in there? You can't just emit the bytes EF BB BF since they will be encoded and changed. However, the solution is quite simple. prw.write( '\ufeff' ); at the head of the file. This will be encoded as EF BB BF.
DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. Java DataOutputStream and ObjectOutputStream uses a slight variant of kosher UTF-8. To aid with compatibility with C in JNI, the null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte, and 3-byte formats are used. Supplementary characters, (above 0xffff), are represented in the form of surrogate pairs (a pair of encoded 16 bit characters in a special range), rather than directly encoding the character.

prw.write( '\ufeff' );就是这个。
于是我的代码变为：
public void htmlWrite(String charsetName) {
        try {
            out = new BufferedWriter(new OutputStreamWriter(
                        new FileOutputStream(outFileName), "UTF-8"));
            out.write('\ufeff');
            out.write(res);
            out.flush();

            System.out.print("write errors!" + e);
        }
    }
问题解决。

清澈的泉水

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
csv文件下载出现乱码

最近有个问题，下载csv文件下来，用excel打开的时候，出现乱码，原因是编码是gbk，用下文所说的用utf-8编码，也不行，不知道是为什么,可能是因为我的office设置主要语言是简体中文的原因。用OutputStreamWriter fos = new OutputStreamWriter( new FileOutputStream(new File("c
复制链接

扫一扫