linux windows文件 编码_从Windows和Linux读取文件产生不同的结果(字符编码?)

Currently I'm trying to read a file in a mime format which has some binary string data of a png.

In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture.

An example after reading the file in Windows is below:

--fh-mms-multipart-next-part-1308191573195-0-53229

Content-Type: image/png;name=app_icon.png

Content-ID: ""

content-location: app_icon.png

‰PNG

etc...etc...

An example after reading the file in Linux is below:

--fh-mms-multipart-next-part-1308191573195-0-53229

Content-Type: image/png;name=app_icon.png

Content-ID: ""

content-location: app_icon.png

�PNG

etc...etc...

I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols.

Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.

解决方案

� is a sequence of three characters - 0xEF 0xBF 0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement character for illegal UTF-8 sequences.

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89 (and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89 character is displayed as ‰.

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89 with the UTF-8 replacement characters 0xEF 0xBF 0xBD (how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as �.

If you need to resolve this problem, you'll need to ensure the following in Linux:

Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.

When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFD it is actually the diamond character �) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).

It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.

* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
在Qt中,要读取Linux文件并将其转换为不同编码格式,可以使用QFile和QTextStream类。首先,使用QFile类打开要读取的文件,指定打开文件的的方式和路径。然后,使用QTextStream类将文件内容读入到程序中,可以使用readAll()函数将整个文件一次性读入或者使用readLine()函数按行读取文件内容。 要将转换编码,可以使用QTextCodec类。首先,确定文件的原始编码格式,然后使用QTextCodec::codecForName()函数获得对应的编码格式对象。接下来,使用QTextCodec::toUnicode()函数将读取的文件内容转换为Unicode编码。 示例代码如下: ```cpp QFile file("/path/to/linux_file.txt"); if (file.open(QIODevice::ReadOnly | QIODevice::Text)) { QTextStream in(&file); // 设置原始编码格式 QTextCodec *codec = QTextCodec::codecForName("UTF-8"); // 将文件内容按行读取并转换编码 while (!in.atEnd()) { QString line = codec->toUnicode(in.readLine().toUtf8()); // 进行后续的操作,比如输出到控制台或者进行字符串处理 qDebug() << line; } file.close(); } ``` 在上述例子中,我们假设Linux文件的原始编码格式为UTF-8。如果原始编码格式不是UTF-8,需要根据实际情况使用其他的编码格式。同时,根据实际文件内容,可能还需要对读取的内容进行进一步的处理,比如字符串操作或者数据解析等。 总之,通过使用Qt中的QFile、QTextStream和QTextCodec类,我们可以方便地读取Linux文件并进行编码转换操作。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值