domdocument40
A few weeks back I shared how I used PHP DOMDocument to reliably update all image URLs from standard HTTP to HTTPS. DOMDocument made a difficult problem seem incredibly easy ... but with one side-effect that it took me a while to spot: UTF-8 characters were being mutated into another set of characters. I was seeing a bunch of odd characters like "ãç³" and"»ã®é" all over each blog post.
几周前,我分享了如何使用PHP DOMDocument可靠地将所有图像URL从标准HTTP更新为HTTPS。 DOMDocument使一个棘手的问题似乎变得异常容易……但是有一个副作用,我花了一段时间才发现:UTF-8字符被突变为另一组字符。 我在每个博客文章中看到一堆奇怪的字符,例如“ãç³”和“»ã®é”。
I knew the problem was happening during the DOMDocument parsing and that I need to find a fix quickly. The solution was just a tiny bit of code:
我知道在DOMDocument解析期间会发生问题,因此我需要快速找到修复程序。 解决方案只是一小段代码:
// Create a DOMDocument instance
$doc = new DOMDocument();
// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
After setting the character set with mb_convert_encoding
, the odd characters vanished and the desired characters were back in place. Phew!
用mb_convert_encoding
设置字符集mb_convert_encoding
,奇数字符消失了,所需的字符又恢复了原位。 !
domdocument40