java id3v2 乱码,ID3v2.3标签中的文本编码

Thanks to this site and a few others, I've created some simple code to read ID3v2.3 tags from MP3 files. Doing so has been a great learning experience as I previously had no knowledge of hex / byte / binary etc.

I can successfully read data, but have come across an issue that I believe is to do with encoding used. I've realized that Text frames have a byte at the beginning of the 'text' that describes encoding used, and potentially more information in the next 2 bytes...

Example:

Data from frame TIT2 starts with the byte $03 (hex) before the actual text. This text displays correctly, albeit with an additional character at the beginning, using Encoding.ASCII.GetString

In another MP3, data from TIT2 starts $01 and is followed by $FF $FE, which I believe is to do with Unicode? The text itself is broken up though, there are $00 between every text character, and this stops the data from being displayed in windows forms (as soon as a 00 is encountered, the text just stops, so I get the first character and that's it). I've tried using Encoding.UNICODE.GetString, but that just seems to return gibberish.

Printing this data to a console seems to work, with spaces between each char, so the reading of the data is working properly.

I've been reading the official documentation for ID3v2.3 but I guess I'm just not clued-up enough to understand the text encoding section.

Any replies or links to articles that may be of help would be much appreciated!

Regards

Ross

解决方案

Data from frame TIT2 starts with the byte $03 (hex) before the actual text. This text displays correctly, albeit with an additional character at the beginning, using Encoding.ASCII.GetString

Encoding 0x03 is UTF-8, so you should use Encoding.UTF8.GetString. The character at the beginning may be U+FEFF Byte Order Mark, which is used to distinguish between UTF-16LE and UTF-16BE... it's no use for UTF-8, but Windows tools love to put it there anyway.

UTF-8 is an ID3v2.4 feature not present in 2.3, which may be why you can't find it in the spec. In the real world you will find all sorts of total nonsense in ID3 tags regardless of version.

data from TIT2 starts $01 and is followed by $FF $FE, which I believe is to do with Unicode? The text itself is broken up though, there are $00 between every text character,

That's UTF-16LE, the text-to-byte encoding that Windows misleadingly calls “Unicode”. It is made up of two-byte code units, so the characters in the range U+0000–U+00FF come out as the low-byte of the same number, followed by a zero high-byte. The 0xFF-0xFE prefix is a Byte Order Mark correctly used. Encoding.Unicode.GetString should return a correct string from this—post some code?

Printing this data to a console seems to work

Getting non-ASCII characters to print on the Windows console can be a trial, so if you hit problems bear in mind they may be caused by the print operation itself.

For completeness, encoding 0x02 is UTF-16BE without a BOM (there is little reason for this to exist and I have never met this in the wild at all), and encoding 0x00 is supposed to be ISO-8859-1, but in reality could be pretty much any ASCII-superset encoding, more likely a Windows ‘ANSI’ code page like Encoding.GetEncoding(1252) than a standard like 8859-1.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值