
utf 8转ansi乱码
UPDATE: The contractor/vendor that made the software commented on Hacker News with more technical information . They're a very classy shop and have handled this REALLY minor gaffe very well, to their credit. I mean, let's put this into perspective, it's a fun nit, it's a weird thing that only we programmers understand, but ultimately what we can all agree on is Obama should outlaw Smart Quotes immediately.更新:制作该软件的承包商/供应商在Hacker News上评论了更多技术信息。 他们是一家非常经典的商店,并以良好的信誉很好地处理了这个非常小的失误。 我的意思是,让我们将其放在一个角度来看,这很有趣,只有我们的程序员才能理解这是一件很奇怪的事情,但是最终我们可以达成共识的是,奥巴马应该立即取缔Smart Quotes。
The Speaker of the House of Representatives John Boehner tweeted this a few days ago. Note that this is not a political blog post.
众议院议长约翰·博纳几天前在推特上发了推文。 请注意,这不是政治博客文章。
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama
在经历了创纪录的支出狂潮之后,这使我们背上了更深的债务,工作在哪里? #AskObama
During the #AskObama Live Twitter event, the Tweets then came up on a big Plasma screen. This tweet came up "garbled" and said:
在#AskObama Live Twitter活动中,随后在大型等离子屏幕上显示了推文。 此推文出现“乱码”,并表示:
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama
在经历了创纪录的支出狂潮之后,这使我们背上了更深的债务,工作在哪里? #AskObama
And a million programmers, regardless of political party, groaned in unison. First, because someone screwed up their UTF-8 decoding, by not doing it, and second, because our President doesn't recognize a text encoding bug when he sees one! Well, maybe that second one was just me, but still. Tragic. The President then teased the Speaker for his typing while newspapers and news organizations struggled to get their minds around this "garbled tweet."
不论政党如何,一百万程序员齐声吟。 首先,是因为有人不这样做而搞砸了他们的UTF-8解码,其次,是因为我们的总裁在看到一个UTF-8解码器时看不到文本编码错误! 好吧,也许那个第二个人就是我,但是仍然如此。 悲惨。 总统随后嘲笑议长的打字方式,而报纸和新闻机构都在努力使自己的想法绕开“乱码”。
Well, Boehner could have tweeted "that's left us deeper..." but he tweeted "that’s." Note the "smart" apostrophe. He used Tweetdeck to tweet it, and it was likely on a Mac. It's also possible that he wrote the tweet in Microsoft Word then copy pasted it as Word loves to change quotes and apostrophes ' into smart quotes and smart apostrophes with direction like this ’.
好吧,Boehner本可以在Twitter上发布“那让我们更深入...”,但他在Twitter上发布了“那是。” 注意“智能”撇号。 他使用Tweetdeck进行了鸣叫,这可能是在Mac上进行的。 也有可能他在Microsoft Word中编写了该tweet,然后将其复制粘贴,因为Word喜欢将引号和撇号'更改为智能引号和方向为'这样的智能撇号。
I can get John Boehner's User ID (not his twitter name, but the number that represents John) with this online tool http://www.idfromuser.com. I see that it's 5357812, so I can get his timeline as RSS (Really Simple Syndication)/XML like this: http://twitter.com/statuses/user_timeline/5357812.rss or JSON (JavaScript Object Notation) like this http://twitter.com/statuses/user_timeline/5357812.json
我可以使用此在线工具http://www.idfromuser.com获得John Boehner的用户ID(不是他的Twitter名称,而是代表John的数字)。 我看到它是5357812,因此我可以将他的时间轴设为RSS(真正简单的联合组织)/ XML,如下所示: http : //twitter.com/statuses/user_timeline/5357812.rss或JSON(JavaScript Object Notation),如下所示: //twitter.com/statuses/user_timeline/5357812.json
When I ask for this timeline, the HTTP Headers say it's encoded as "UTF-8", see?
当我要求此时间轴时,HTTP标头说它被编码为“ UTF-8”,看到了吗?
Content-Type: application/json; charset=utf-8
内容类型:application / json; 字符集= utf-8
I blogged about the "Importance of being UTF-8" about five years ago. If you look at the JSON and find the tweet with the ID 88618213008621568, you can see the raw text encoded in JSON:
大约五年前,我在博客上发表了有关“成为UTF-8的重要性”的博客。 如果查看JSON并找到ID为88618213008621568的推文,则可以看到以JSON编码的原始文本:
"text":"After embarking on a record spending binge that\u2019s left us deeper in debt, where are the jobs?"
“ text”:“在经历了创纪录的支出狂潮之后,这使我们背上了更深的债务,工作在哪里?”
See that \u2019? In Windows (you have this program even if you aren't a developer) go to the Start Menu and run "Charmap." Look around and you can see U+2019 is Right Single Quotation Mark. Note that it's WAY down in the list of all the characters. It's not a basic character like A to Z or a to z. It's one of those special things that looks nice, but causes trouble later.
看到那个\ u2019吗? 在Windows中(即使您不是开发人员,也有此程序),请转到“开始”菜单并运行“ Charmap”。 环顾四周,您会看到U + 2019是正确的单引号。 请注意,它在所有字符的列表中都位于下方。 它不是基本字符,例如A到Z或a到z。 这是看起来不错的特殊事物之一,但以后会引起麻烦。

If I make a text file in Notepad that looks like this and name it text.txt, for example, and Save As, making sure to use UTF-8 as the encoding...
如果我在记事本中制作一个类似于以下内容的文本文件,并将其命名为text.txt(例如,另存为),请确保使用UTF-8作为编码...
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs?
在经历了创纪录的支出狂潮之后,这使我们背上了更深的债务,工作在哪里?
...then load it into any free HEX editor (or even an online one!) I see this:
...然后将其加载到任何免费的HEX编辑器(甚至是在线的! )中,我会看到:

Note that the part where the ’ was is actually three full bytes! E2 80 99.
注意,'所在的部分实际上是三个完整字节! E2 80 99 。
Well, UTF-8 is an encoding whose goal was to not only support a bajillion different characters but also to be backwards compatible with ASCII, the American Standard Code for Information Interchange. If it wasn't, we wouldn't be able to see MOST of the characters in this tweet! In this case, just the ’ is goofy.
嗯,UTF-8是一种编码,其目标不仅是支持数十亿个不同的字符,而且还与美国信息交换标准代码ASCII向后兼容。 如果不是,我们将无法在此推文中看到大多数字符! 在这种情况下,只是'是愚蠢的。
The code point was U+2019, which is 0010 0000 0001 1001, says Windows Calculator in Programmer Mode. You have this too, Dear Reader. There's some variable width encoding going on, that you can read about on Wikipedia.
Windows程序员模式下的代码点是U + 2019,它是0010 0000 0001 1001。 亲爱的读者,您也有这个。 正在进行一些可变宽度编码,您可以在Wikipedia上阅读。
This value of U+2019 expands to: 0010 0000 0001 1001, as I said, which then expands acording to these rules
如我所说,U + 2019的值扩展为:0010 0000 0001 1001,然后按照这些规则扩展
zzzzyyyy yyxxxxxx ->
1110zzzz 10yyyyyy 10xxxxxx
zzzzyyyy yyxxxxxx ->
1110zzzz 10yyyyyy 10xxxxxx
Which gives us this
这给了我们
11100010 -> E2
10000000 -> 80
10011001 -> 99
11100010-> E2 10000000-> 80 10011001-> 99
hence, "that’s" is encoded as
因此,“ that”被编码为
74 68 61 74 E2 80 99 73
74 68 61 74 E2 80 99 73
I've bolded the ’. Which then, read back in - this time as Extended ASCII (the ANSI Windows 1252 Code page) we get the ’ expanded:
我已经加粗了'。 然后,以扩展ASCII(ANSI Windows 1252代码页)的形式读回-我们得到了'扩展:
that’s
一个€™的
Made it this far? Why didn't I just say "The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252." Because that wouldn't be nearly as fun.
到此为止了吗? 我为什么不直接说“该软件读取了UTF-8编码的JSON tweets流,并在ANSI Windows Code Page 1252中显示了它。 ”因为那样的乐趣不大。
Either way, the company that did this for the White House definitely goofed up and should have tested this. This is SUCH a classic
sloppy
programmer mistake that I'm disappointed to see it showcased so blatantly. I hope they (the vendor) feel a little bad. The company appears to be called "Mass Relevance" and here's some news articles about Mass Relevance and their "Tweet Curation."
无论哪种方式,为白宫做这件事的公司肯定搞砸了,应该对此进行测试。 这是一个经典的
草率
程序员错误,我很失望地看到它如此显眼地展示出来。 我希望他们(供应商)感到有些难过。 该公司似乎被称为“大众相关性”,这是一些有关大众相关性及其“推文策划”的新闻报道。
Testing, testing, testing, my friends. And not only testing, but KNOW this stuff. They don't always teach it in schools and no one will learn until they see their bug on national TV in front of the President of the United States. ;)
测试,测试,测试,我的朋友们。 不仅要测试,还要知道这些东西。 他们并不总是在学校教书,没有人会学,直到他们在美国总统面前的国家电视上看到他们的虫子。 ;)
UPDATE: The vendor said this in the comments. Very well said.
更新:供应商在评论中说了这一点。 说得好。
"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.
“这绝对是我们的错误。问题不是我们的数据提要中的编码,而是HTML文档以ISO-8859-1发送。第二步,我们将twitter文本插入DOM,浏览器解释了UTF -8字符串,如ISO-8859-1。我们的可视化文件托管在其他平台上,并且在这种情况下,即使HTML文件是这样编码的,也未将服务器配置为发送带有文本/ html的UTF-8。在一个本来可以完美无瑕的事件中发布(虽然很明显),我向奥巴马总统,议长博纳和杰克·多尔西的错误表示歉意。如果博客的读者认为这很愚蠢,请想象一下我们的感受。开发环境!=生产环境,如果我们只是在HTML头中包含<meta charset =“ utf-8”>,那么就不会发生这种情况。
The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag."
最大的收获是不要对其他平台进行假设(尤其是在编码方面),而总要包含charset元标记。”
Text encoding is fun for all ages. Enjoy!
各个年龄段的文本编码都很有趣。 请享用!
* Like this post? Put me on TV, folks. This is the kind of stuff that a real technology journalist *Pogue* would love to share with the people! ABC News? I'm available and I have Skype. Call my people. ;)
*喜欢这篇文章吗? 伙计们,让我上电视。 真正的技术记者* Pogue *希望与人们分享这种东西! ABC新闻? 我有空,我有Skype。 给我的人打电话。 ;)
utf 8转ansi乱码