domdocument40_DOMDocument和UTF-8问题

最新推荐文章于 2021-03-19 00:28:19 发布

culuo8053

最新推荐文章于 2021-03-19 00:28:19 发布

阅读量160

点赞数

文章标签： python java mysql linux 人工智能

原文链接：https://davidwalsh.name/domdocument-utf8-problem

版权

domdocument40

A few weeks back I shared how I used PHP DOMDocument to reliably update all image URLs from standard HTTP to HTTPS. DOMDocument made a difficult problem seem incredibly easy ... but with one side-effect that it took me a while to spot: UTF-8 characters were being mutated into another set of characters. I was seeing a bunch of odd characters like "ãç³" and"»ã®é" all over each blog post.

几周前，我分享了如何使用PHP DOMDocument可靠地将所有图像URL从标准HTTP更新为HTTPS。 DOMDocument使一个棘手的问题似乎变得异常容易……但是有一个副作用，我花了一段时间才发现：UTF-8字符被突变为另一组字符。我在每个博客文章中看到一堆奇怪的字符，例如“ãç³”和“»ã®é”。

I knew the problem was happening during the DOMDocument parsing and that I need to find a fix quickly. The solution was just a tiny bit of code:

我知道在DOMDocument解析期间会发生问题，因此我需要快速找到修复程序。解决方案只是一小段代码：


// Create a DOMDocument instance 
$doc = new DOMDocument();

// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));

After setting the character set with mb_convert_encoding, the odd characters vanished and the desired characters were back in place. Phew!

用mb_convert_encoding设置字符集mb_convert_encoding ，奇数字符消失了，所需的字符又恢复了原位。！