如何移植unicode_通过可移植的UTF-8将Unicode引入PHP-CSDN博客

如何移植unicode

PHP allows multibyte variable names like $a∩b, $Ʃxy and $Δx, mbstring and other extensions work with Unicode strings, and the utf8_encode() and utf8_decode() functions translate strings between the UTF-8 and ISO-8859-1 encodings. Yet it’s widely acknowledged that PHP lacks Unicode support.

PHP允许$a∩b ， $Ʃxy和$Δx ，mbstring等多字节变量名和Unicode字符串一起使用其他扩展名，并且utf8_encode()和utf8_decode()函数在UTF-8和ISO-8859-1编码之间转换字符串。然而，人们普遍承认PHP缺乏Unicode支持。

This article covers what the lack of Unicode support means, and demonstrates the use of a library that brings Unicode support to your PHP application, Portable UTF-8.

本文介绍了缺少Unicode支持的含义，并演示了如何使用可为您PHP应用程序Portable UTF-8带来Unicode支持的库。

PHP中的Unicode支持 (Unicode Support in PHP)

PHP’s lack of Unicode/multibyte support means that the standard string handling functions treat strings as a sequence of single-byte characters. In fact, the official manual defines a string in PHP as a “series of characters, where a character is the same as a byte.” PHP supports only 8-bit characters, while Unicode (and many other character sets) may require more than one byte to represent a character. This limitation of PHP affects almost all aspects of string manipulation, including (but not limited to) substring extraction, determining string lengths, string splitting, shuffling etc.

PHP缺少Unicode /多字节支持，这意味着标准的字符串处理功能将字符串视为单字节字符序列。实际上，官方手册在PHP中将字符串定义为“一系列字符，其中字符与字节相同”。 PHP仅支持8位字符，而Unicode(和许多其他字符集)可能需要一个字节以上才能表示一个字符。 PHP的这种局限性会影响字符串操作的几乎所有方面，包括(但不限于)子字符串提取，确定字符串长度，字符串拆分，改组等。

Efforts to solve the problem started in early 2005, but the work on bringing native Unicode support to PHP was stopped and shelved in 2010 for several reasons. Since native Unicode support in PHP may take years to come, if ever, developers must rely on extensions like mbstring and iconv that are available to fill the gap but that provide just limited Unicode support. These libraries are not Unicode centric, and are capable to translate between non-Unicode encodings too. They make a positive contribution in an attempt to ease working with Unicode strings.

解决该问题的努力始于2005年初，但是由于多种原因，在2010年停止并搁置了将本机Unicode支持引入PHP的工作。由于PHP中对本机Unicode的支持可能要花费数年，因此，开发人员必须依靠mbstring和iconv之类的扩展来填补空白，但仅提供有限的Unicode支持。这些库不是以Unicode为中心的，并且也能够在非Unicode编码之间进行转换。它们为简化使用Unicode字符串的工作做出了积极的贡献。

But the aforementioned extensions also have their shortcomings. They provide just limited functionality for Unicode string handling, and none of them are enabled by default. The server administrator must explicitly enable any or all of the extensions to make them accessible through PHP applications. Shared hosting providers often make the situation worse by installing one or two of the extensions, making it difficult for developers to rely on a consistently available API for their Unicode needs.

但是上述扩展也有其缺点。它们仅提供有限的Unicode字符串处理功能，并且默认情况下未启用任何功能。服务器管理员必须显式启用任何或所有扩展，以使其可通过PHP应用程序访问。共享主机提供商通常会通过安装一个或两个扩展来使情况变得更糟，这使开发人员很难依靠一致可用的API来满足其Unicode需求。

Despite all of this, the good thing is that PHP can output Unicode text. This is because PHP doesn’t really care whether we are sending out English text encoded in ASCII or some other text that belongs to a language whose characters are encoded in multiple bytes. Knowing this, what PHP developers now need is only an API that provides comfortable Unicode-based string manipulation.

尽管如此，PHP还是可以输出Unicode文本。这是因为PHP并不在乎我们是发送用ASCII编码的英语文本还是属于某种字符编码为多个字节的语言的其他文本。知道了这一点，PHP开发人员现在只需要一个提供舒适的，基于Unicode的字符串操作的API。

便携式UTF-8 (Portable UTF-8)

A recent solution is the creation of user-space libraries written in PHP. These libraries can be easily bundled with an application to ensure the presence of Unicode support, even if support at the server/language level is missing. Many open-source applications already include their own such libraries and many more use freely available third-party libraries; one such library is Portable UTF-8.

最近的解决方案是创建用PHP编写的用户空间库。这些库可以轻松地与应用程序捆绑在一起，以确保存在Unicode支持，即使缺少服务器/语言级别的支持也是如此。许多开源应用程序已经包含了自己的此类库，还有更多使用免费的第三方库。这样的库之一就是Portable UTF-8 。

Portable UTF-8 is a free, lightweight library built over mbstring and iconv. It extends the capabilities of the two extensions to provide about 60 functions for Unicode-based string manipulation, testing, and validation; it offers UTF-8 aware counterparts for almost all of PHP’s common string-handling functions. As its name suggests, Portable UTF-8 uses UTF-8 as its primary character encoding scheme.

可移植的UTF-8是一个免费的轻量级库，基于mbstring和iconv构建。它扩展了这两个扩展的功能，为基于Unicode的字符串操作，测试和验证提供了大约60个功能。它为几乎所有PHP常见的字符串处理函数提供了UTF-8感知的副本。顾名思义，Portable UTF-8使用UTF-8作为其主要字符编码方案。

The library uses the available extensions (mbstring and iconv) for speed reasons, and smooths over some of the inconsistencies of working with them directly, but falls back to UTF-8 routines written in pure PHP if the extensions aren’t available on the server. Portable-UT8 is fully portable and works with any installation of PHP version 4.2 or higher.

该库出于速度原因使用可用的扩展名(mbstring和iconv)，并消除了直接使用它们的一些不一致之处，但是如果服务器上不存在扩展名，则使用纯PHP编写的UTF-8例程。 Portable-UT8是完全可移植的，可与PHP 4.2或更高版本的任何安装一起使用。

使用便携式UTF-8进行字符串处理 (String Handling with Portable UTF-8)

A text editor with bad Unicode support may corrupt text when reading it, and text copied from such an editor and posted into a web form might a source of invalid UTF-8 to your application. When dealing with user submitted input, it’s important that we make sure the input is exactly what the application expects. To detect whether the text is valid UTF-8, we can use the library’s is_utf8() function.

具有不良Unicode支持的文本编辑器在阅读文本时可能会损坏文本，从这种编辑器复制并发布到Web表单中的文本可能是应用程序无效UTF-8的来源。在处理用户提交的输入时，重要的是要确保输入正是应用程序期望的。要检测文本是否为有效的UTF-8，我们可以使用库的is_utf8()函数。

if (is_utf8($_POST['title'])) {
    // do something...
}

Recovering characters from invalid bytes is an impossible exercise, so stripping out the bytes that cannot be recognized as valid UTF-8 characters might be your only option. We can strip invalid bytes with the utf8_clean() function.

从无效字节中恢复字符是不可能的练习，因此，将不能被识别为有效UTF-8字符的字节去除是唯一的选择。我们可以使用utf8_clean()函数剥离无效字节。

$title = utf8_clean($_POST['title']);

Each Unicode character can be encoded to a corresponding HTML entity, and you may want to encode text this way to help prevent XSS attacks before outputting it to browser.

可以将每个Unicode字符编码为相应HTML实体，并且您可能希望以此方式对文本进行编码，以帮助防止XSS攻击，然后再将其输出到浏览器。

echo utf8_html_encode($title);

It’s common to trim whitespace at the start and the end of a string. Unicode lists about 20 whitespace characters, and there are some ASCII-based control characters that should be considered as well for such trimming.

通常在字符串的开头和结尾处修剪空白。 Unicode列出了大约20个空白字符，并且对于此类修剪，还应考虑一些基于ASCII的控制字符。

$title = utf8_trim($title);

On the other hand, there may be duplicates of such whitespaces in the middle of the string that should be removed. The follow shows how the utf8_remove_duplicates() and utf8_ws() can be used together:

另一方面，应该在字符串中间删除此类空白。以下内容显示了如何将utf8_remove_duplicates()和utf8_ws()一起使用：

$title = utf8_remove_duplicates($title, utf8_ws());

Traditional solutions for creating URL slugs for SEO reasons use transliteration and strip all non-ASCII characters from the slug. That makes a URL less valuable than it could be. While URLs can support UTF-8 encoded characters, there is no need for such stripping or transliteration, and we can create rich slugs containing characters of any language:

出于SEO的原因，用于创建URL标记的传统解决方案使用音译并从标记中剥离所有非ASCII字符。这样一来，URL的价值就不那么有价值了。虽然URL可以支持UTF-8编码的字符，但是不需要这种剥离或音译，并且我们可以创建包含任何语言字符的丰富信息段：

$slug = utf8_url_slug($title, 30); // char length 30

From the start with input validation until we save the data to some database, a Unicode-aware application focuses on characters and character-length rather than bytes and byte-length. This shift of focus necessitates a new interface that understands the difference. It’s common to enforce a limit on input character-length, so here we are creating a sub-string if the input exceeds the length of 60 characters.

从输入验证到将数据保存到某个数据库为止，一个支持Unicode的应用程序着重于字符和字符长度，而不是字节和字节长度。焦点的转移需要一个新的界面来理解差异。通常会限制输入字符的长度，因此，如果输入超过60个字符的长度，我们将在此处创建一个子字符串。

if (utf8_strlen($title) > 60) {
    $title  = utf8_substr($title, 0, 60);
}

Or alternatively:

或者：

if (!utf8_fits_inside($title , 60)) {
    $title  = utf8_substr($title, 0 ,60);
}

There are three different ways to access individual character with the Portable-UT8 library. We can use utf8_access() to reach an individual character.

使用Portable-UT8库可以使用三种不同的方法来访问单个字符。我们可以使用utf8_access()来获得单个字符。

echo 'The sixth character is: ' . utf8_access($string, 5);

utf8_chr_map() allows individual character access iteratively using a callback function.

utf8_chr_map()允许使用回调函数迭代访问单个字符。

utf8_chr_map('some_callback', $string);

And we can split a string into an array of characters using utf8_split() and work with the array elements as individual characters.

我们可以使用utf8_split()将字符串分割为字符数组，并将数组元素作为单独的字符使用。

array_map('some_callback', utf8_split($string));

Working with Unicode may also require that we find the minimum/maximum code point in a string, splitting strings, working with byte order mark, string case conversion, randomizing/shuffling, replacing, etc. All of that is supported by Portable-UT8.

使用Unicode可能还需要我们在字符串中找到最小/最大代码点，分割字符串，使用字节顺序标记，字符串大小写转换，随机化/混排，替换等。Portable-UT8支持所有这些。

结论 (Conclusion)

Development of PHP 6 has been stopped, resulting in a delay for the much needed native Unicode support for developing multilingual applications. So in the meantime, server side extensions and user-space libraries like Portable UTF-8 play an important role in helping developers in making a better standardized web that meets local needs.

PHP 6的开发已经停止，导致开发多语言应用程序所需的本机Unicode支持的延迟。因此，与此同时，服务器端扩展和诸如Portable UTF-8之类的用户空间库在帮助开发人员制作更好的标准化Web以满足本地需求方面发挥了重要作用。