检测编码并制作一切UTF-8

最新推荐文章于 2021-08-12 10:24:04 发布

p15097962069

最新推荐文章于 2021-08-12 10:24:04 发布

阅读量224

点赞数

文章标签： php encoding utf-8 character-encoding

原文链接：https://oldbug.net/q/3owD/Detect-encoding-and-make-everything-UTF-8

版权

本文翻译自：Detect encoding and make everything UTF-8

I'm reading out lots of texts from various RSS feeds and inserting them into my database. 我正在从各种RSS提要中读取很多文本，并将其插入到数据库中。

Of course, there are several different character encodings used in the feeds, eg UTF-8 and ISO 8859-1. 当然，提要中使用了几种不同的字符编码，例如UTF-8和ISO 8859-1。

Unfortunately, there are sometimes problems with the encodings of the texts. 不幸的是，文本的编码有时会出现问题。 Example: 例：

The "ß" in "Fußball" should look like this in my database: "ÂŸ". “Fußball”中的“ß”在我的数据库中应如下所示：“ÂŸ”。 If it is a "ÂŸ", it is displayed correctly. 如果它是“Â”，则正确显示。
Sometimes, the "ß" in "Fußball" looks like this in my database: "ÃƒÂŸ". 有时，“Fußball”中的“ß”在我的数据库中看起来像这样：“ÃƒÂŸ”。 Then it is displayed wrongly, of course. 然后，当然会显示错误。
In other cases, the "ß" is saved as a "ß" - so without any change. 在其他情况下，“ß”另存为“ß”-因此无需进行任何更改。 Then it is also displayed wrongly. 然后它也会显示错误。

What can I do to avoid the cases 2 and 3? 我该如何避免情况2和情况3？

How can I make everything the same encoding, preferably UTF-8? 如何使所有内容都使用相同的编码，最好是UTF-8？ When must I use utf8_encode() , when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input? 什么时候必须使用utf8_encode() ，什么时候必须使用utf8_decode() （很明显会产生什么效果，但是什么时候必须使用这些函数？），什么时候不对输入执行任何操作？

How do I make everything the same encoding? 如何使所有内容都具有相同的编码？ Perhaps with the function mb_detect_encoding() ? 也许使用mb_detect_encoding()函数？ Can I write a function for this? 我可以为此编写函数吗？ So my problems are: 所以我的问题是：

How do I find out what encoding the text uses? 如何找出文字使用的编码方式？
How do I convert it to UTF-8 - whatever the old encoding is? 我如何将其转换为UTF-8-不管旧的编码是什么？

Would a function like this work? 这样的功能会起作用吗？

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it, but it doesn't work. 我已经测试过了，但是没有用。 What's wrong with it? 它出什么问题了？

#1楼

参考：https://stackoom.com/question/3owD/检测编码并制作一切UTF

#2楼

I had same issue with phpQuery ( ISO-8859-1 instead of UTF-8 ) and this hack helped me: 我在phpQuery上遇到了同样的问题（ ISO-8859-1而不是UTF-8 ），这个hack帮助了我：

$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;

mb_internal_encoding('UTF-8') , phpQuery::newDocumentHTML($html, 'utf-8') , mbstring.internal_encoding and other manipulations didn't take any effect. mb_internal_encoding('UTF-8') ， phpQuery::newDocumentHTML($html, 'utf-8') ， mbstring.internal_encoding和其他操作均无效。

#3楼

Get encoding from headers and convert it to utf-8. 从标题获取编码并将其转换为utf-8。

$post_url='http://website.domain';

/// Get headers 
function get_headers_curl($url) 
{ 
    $ch = curl_init(); 

    curl_setopt($ch, CURLOPT_URL,            $url); 
    curl_setopt($ch, CURLOPT_HEADER,         true); 
    curl_setopt($ch, CURLOPT_NOBODY,         true); 
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
    curl_setopt($ch, CURLOPT_TIMEOUT,        15); 

    $r = curl_exec($ch); 
    return $r; 
}
$the_header = get_headers_curl($post_url);
/// check for redirect /
if (preg_match("/Location:/i", $the_header)) {
    $arr = explode('Location:', $the_header);
    $location = $arr[1];

    $location=explode(chr(10), $location);
    $location = $location[0];

$the_header = get_headers_curl(trim($location));
}
/// Get charset /
if (preg_match("/charset=/i", $the_header)) {
    $arr = explode('charset=', $the_header);
    $charset = $arr[1];

    $charset=explode(chr(10), $charset);
    $charset = $charset[0];
    }
///
// echo $charset;

if($charset && $charset!='UTF-8') { $html = iconv($charset, "UTF-8", $html); }

#4楼

I know this is an older question, but I figure a useful answer never hurts. 我知道这是一个比较老的问题，但是我认为一个有用的答案永远不会让您感到痛苦。 I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. 我在桌面应用程序，SQLite和GET / POST变量之间的编码存在问题。 Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved. 有些使用UTF-8，有些使用ASCII，并且基本上，当涉及到外来字符时，所有内容都会搞砸。

Here is my solution. 这是我的解决方案。 It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. 在处理之前，它会在每次加载页面时擦洗您的GET / POST / REQUEST（我省略了cookie，但可以根据需要添加它们）。 It works well in a header. 它在标头中运行良好。 PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s. 如果PHP无法自动检测源编码，则会抛出警告，因此这些警告将被@抑制。

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}

#5楼

A really nice way to implement an isUTF8 -function can be found on php.net : 可以在php.net上找到实现isUTF8功能的一种非常好的方法：

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}

#6楼

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output. 如果将utf8_encode()应用于已经存在的UTF-8字符串，它将返回乱码的UTF-8输出。

I made a function that addresses all this issues. 我做了一个解决所有这些问题的功能。 It´s called Encoding::toUTF8() . 这就是所谓的Encoding::toUTF8() 。

You don't need to know what the encoding of your strings is. 您不需要知道字符串的编码是什么。 It can be Latin1 ( ISO 8859-1) , Windows-1252 or UTF-8, or the string can have a mix of them. 它可以是Latin1（ ISO 8859-1）， Windows-1252或UTF-8，或者字符串可以混合使用。 Encoding::toUTF8() will convert everything to UTF-8. Encoding::toUTF8()会将所有内容转换为UTF-8。

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string. 我之所以这样做，是因为某项服务使我的数据馈送全乱了，在同一字符串中混合了UTF-8和Latin1。

Usage: 用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download: 下载：

https://github.com/neitanod/forceutf8 https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8() , which will fix every UTF-8 string that looks garbled. 我包括了另一个函数Encoding::fixUFT8() ，该函数将修复每个看起来乱码的UTF-8字符串。

Usage: 用法：

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples: 例子：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output: 将输出：

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function ( forceUTF8 ) into a family of static functions on a class called Encoding . 我已经将函数（ forceUTF8 ）转换为称为Encoding的类的静态函数系列。 The new function is Encoding::toUTF8() . 新函数是Encoding::toUTF8() 。

p15097962069

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
检测编码并制作一切UTF-8

I'm reading out lots of texts from various RSS feeds and inserting them into my database. 我正在从各种RSS
复制链接

扫一扫