字节字符串字节字符串_使用多字节字符串

最新推荐文章于 2024-04-18 17:32:53 发布

culi3182

最新推荐文章于 2024-04-18 17:32:53 发布

阅读量429

点赞数

文章标签：字符串 python java php linux

原文链接：https://www.sitepoint.com/working-with-multibyte-strings/

版权

字节字符串字节字符串

A written language, whether it’s English, Japanese, or whatever else, consists of a number of characters, so an essential problem when working with a language digitally is to find a way to represent each character in a digital manner. Back in the day we only needed to represent English characters, but it’s a whole different ball game today and the result is a bewildering number of character encoding schemes used to represent the characters of many different languages. How does PHP relate to and deal with these different schemes?

书面语言(无论是英语，日语还是其他语言)都由许多字符组成，因此在数字化使用某种语言时，一个基本问题是找到一种以数字方式表示每个字符的方法。过去，我们只需要代表英文字符，但是今天这是一场完全不同的局面，结果是令人困惑的用于代表多种不同语言的字符的多种字符编码方案。 PHP如何与这些不同的方案联系并处理？

基础 (The Basics)

We all know that a ‘bit’ is a thing that can be either a 0 or 1 and nothing else, and a ‘byte’ is a grouping of eight consecutive bits. Since there are eight of these dual value spots in a byte, one byte can be configured in a total of 256 distinct patterns (2 to the power of 8). It’s possible to associate a different character with each possible 8-bit pattern.

我们都知道，“位”是可以是0或1的东西，“字节”是八个连续位的组合。由于一个字节中有八个双值点，因此一个字节可以配置为总共256个不同的模式(2的幂为8)。可以将不同的字符与每个可能的8位模式相关联。

Put these bytes together in different orders and you have yourself some communication. It’s not necessarily intelligent, that depends on who is at each end, but it is communication. As long as we can express a language’s characters in 256 unique characters or less, we’re set.

将这些字节以不同的顺序放在一起，便可以进行一些通信。它不一定是聪明的，取决于每一端的人，但它是沟通。只要我们可以用256个或更少的唯一字符来表示一种语言的字符，就可以开始设置了。

But what if we can’t express a language with just 256 characters? Or what if we need to express multiple languages in the same document? Today, as we digitize everything we can find, 256 characters is nowhere near enough. Luckily character schemes that are more up to the challenge have been devised. These new, super character sets use anywhere from one to four bytes to define characters.

但是，如果我们不能只用256个字符来表达语言呢？或者，如果我们需要在同一文档中表达多种语言怎么办？今天，当我们数字化找到的所有内容时，256个字符还远远不够。幸运的是，已经设计出了更能应对挑战的字符方案。这些新的超级字符集使用1到4个字节来定义字符。

The big dog in the character encoding scene today is Unicode, a scheme that uses multiple bytes to represent characters. It was developed by the Unicode Consortium and there are several versions of it: UTF-32 which is used on the Dreadnaught class of starships, UTF-16 which is used on the Star Trek: Into Darkness Enterprise, and UTF-8 which is what most of us in the real world should use for our web applications.

当今字符编码领域的大狗是Unicode，它使用多个字节来表示字符。它是由Unicode联盟开发的，它有多个版本：用于Dreadnaught级飞船的UTF-32，用于“星际迷航：进入黑暗企业”的UTF-16和UTF-8我们现实世界中的大多数人都应该将其用于我们的Web应用程序。

As I said, Unicode (including UTF-8) uses multiple byte configurations to represent characters. UTF-8 uses anywhere from one to four bytes to produce the 1,112,064 patterns to represent different characters. These ‘wide characters’ take up more space, but UTF-8 does have a tendency to be faster to process than some other encoding schemes.

如我所说，Unicode(包括UTF-8)使用多个字节配置来表示字符。 UTF-8使用1到4个字节中的任何一个来产生1,112,064个模式以表示不同的字符。这些“宽字符”占用更多空间，但是UTF-8确实比其他一些编码方案有更快的处理速度。

Why is everyone ooh-ing and aah-ing about UTF-8? Partly it’s the hot models that have been spotlighted in the Support UTF-8 commercials seen on ESPN and TCM, but mostly it’s because UTF-8 mimics ASCII and if you don’t have any special characters involved, it tracks ASCII exactly.

为什么每个人都对UTF-8感兴趣？部分原因是在ESPN和TCM上看到的支持UTF-8广告中的热门模型，但主要是因为UTF-8模仿ASCII，并且如果您不涉及任何特殊字符，它将精确地跟踪ASCII。

而这如何影响PHP？ (And This Affects PHP How?)

I know what you’re thinking. I just have to set the character set in my meta tags to ‘UTF-8’ and everything will be okay. But that’s not true.

我知道你在想什么我只需要将meta标签中的字符集设置为“ UTF-8”，一切都会好起来的。但这不是事实。

First, the simple truth is that PHP is not really designed to deal with multibyte characters and so doing things to these characters using the standard string functions may produce uncertain results. When we need to work with these multibyte characters, we need to use a special set of functions: the mbstring functions.

首先，简单的事实是PHP并不是真正设计用于处理多字节字符，因此使用标准字符串函数对这些字符进行处理可能会产生不确定的结果。当我们需要使用这些多字节字符时，我们需要使用一组特殊的函数：mbstring函数。

And second, even if you have PHP under control, there can still be problems. The HTTP headers covering your communication also contain a character set identification and that will override what’s in the meta tag of your page.

第二，即使您控制了PHP，仍然可能存在问题。涵盖您的通信的HTTP标头还包含一个字符集标识，它将覆盖页面的meta标记中的内容。

So, how does PHP deal with multibyte characters? There are two function groups that affect the multibyte stings.

那么，PHP如何处理多字节字符？有两个影响多字节字符串的功能组。

The first is iconv. With 5.0, this has become a default part of the language, a way to convert one character set into another character set representation. This is not what we are going to talk about in this article.

第一个是iconv 。在5.0中，这已成为该语言的默认部分，这是一种将一个字符集转换为另一种字符集表示形式的方法。这不是本文要讨论的内容。

The second is multibyte support, a series of commands prefixed with “mb_”. There are a number of these commands and a quick review shows that some of them relate to determining if characters are appropriate based on the encoding scheme given, and others are search oriented functions, similar to the ones that are part of the PHP regular expressions, but which are oriented around multibyte functions.

第二个是多字节支持，一系列以“ mb_”为前缀的命令。其中有许多命令，快速浏览后发现，其中一些命令与根据给定的编码方案确定字符是否合适有关，而其他命令则是面向搜索的函数，类似于PHP正则表达式的一部分，但它们围绕多字节函数。

开启对PHP的多字节支持 (Turning on Multibyte Support for PHP )

Multibyte support is not a default feature of PHP, but neither does it require that we download any extra libraries or extensions; it just requires some reconfiguration. Unfortunately, if you’re using a hosted version of PHP, this might not be something you can do.

多字节支持不是PHP的默认功能，但是它也不要求我们下载任何额外的库或扩展。它只需要一些重新配置。不幸的是，如果您使用的是PHP的托管版本，则可能无法执行此操作。

Take a look at your configuration using the phpinfo() function. Scroll about half-way down the output and there will be a section labeled “mbstring”. This will show you whether the basic functionality is enabled. For information on how to enable this, you can refer to the manual. In short, you enable the mb functions by using the --enable-mbstring compile time option, and set the run-time configuration option mbstring-encoding_translation.

查看使用phpinfo()函数的配置。在输出中途滚动约一半，将出现一个标记为“ mbstring”的部分。这将向您显示是否启用了基本功能。有关如何启用此功能的信息，您可以参考手册。简而言之，您可以使用--enable-mbstring编译时间选项来启用mb函数，并设置运行时配置选项mbstring-encoding_translation 。

The ultimate solution, of course, is PHP 6 because it will use the IBM (please, everyone remove their ball caps) ICU libraries to ensure native support for multibyte character sets. All we have to do is sit back and wait, eh buddy roe? But until then, check out the multibyte support that is available now.

当然，最终的解决方案是PHP 6，因为它将使用IBM ICU库(请大家卸下球帽)来确保对多字节字符集的本机支持。我们所要做的就是高枕无忧，宝贝鱼子？但是在此之前，请查看现在可用的多字节支持。

多字节字符串命令 (Multibyte String Commands)

It’s possible that there are 53 different multibyte string commands. It’s also possible that there are 54. I sort of lost count at one point, but you get the idea. Needless to say we’re not going to go through each one, but just for kicks let’s take a quick look a few.

可能有53种不同的多字节字符串命令。也可能有54个。我曾一度失落计数，但您明白了。毋庸置疑，我们不会逐一讨论，而只是踢一下，让我们快速看一下。

mb_check_encoding (mb_check_encoding)

The mb_check_encoding() function checks to determine if a specific encoding sequence is valid for an encoding scheme. The function does not tell you what the string is encoded as (or what schemes it will work for), but it does tell you if it will work or not for the specified scheme.

mb_check_encoding()函数检查以确定特定的编码序列对于编码方案是否有效。该函数不会告诉您该字符串的编码方式(或它将适用于哪种方案)，但会告诉您该字符串是否适用于指定方案。

<?php
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');
$valid = mb_check_encoding($string, 'UTF-8');
echo ($valid) ? 'valid' : 'invalid';

You can find a list of the supported encodings in the PHP manual.

您可以在PHP手册中找到受支持的编码列表。

mb_strlen (mb_strlen)

The strlen() function returns the number of bytes in a string. For ASCII where each character is a single byte, this works fine to find the number of characters. With multibyte strings you need to use the mb_strlen() function.

strlen()函数返回字符串中的字节数。对于每个字符都是一个字节的ASCII，查找字符数很好。对于多字节字符串，您需要使用mb_strlen()函数。

<?php
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');

echo strlen($string); // outputs 12 – wrong!
echo mb_strlen($string, 'UTF-8'); // outputs 4

mb_ereg_search (mb_ereg_search )

The mb_ereg_search() function performs a multibyte version of the traditional character search. But there are a few caveats – you need to specify the encoding scheme using the mb_regex_encoding() function, the regular expression doesn’t have delimiters (it’s just the pattern part), and both the regex and string are specified using mb_ereg_search_init().

mb_ereg_search()函数执行传统字符搜索的多字节版本。但是有一些注意事项–您需要使用mb_regex_encoding()函数指定编码方案，正则表达式没有定界符(这只是模式部分)，而regex和string都使用mb_ereg_search_init()指定。

<?php
// specify the encoding scheme
mb_regex_encoding('UTF-8');

// specify haystack and search
$string = 'u4F60u597Du4E16u754C';
$string = json_decode('"' . $string . '"');

$pattern = 'u754C';
$pattern = json_decode('"' . $pattern . '"');

mb_ereg_search_init($string, $pattern);

// finally we can perform the search 
$result = mb_ereg_search();
echo ($result) ? "found" : "not found";

够了吗？ (Had Enough?)

I don’t know about you but I think the world really needs more simple things. Unfortunately, multibyte processing is not going to fill that need. But for now it’s something you can’t ignore. There are a times when you won’t be able to perform normal PHP string processing (because you are trying to do it over characters that exceed the normal ASCII range (U+0000 – U+00FF)). And that means you have to use the mb_ oriented functions.

我不认识你，但我认为世界真的需要更简单的东西。不幸的是，多字节处理无法满足这一需求。但是现在这是您不能忽略的事情。有时您将无法执行正常PHP字符串处理(因为您试图对超出正常ASCII范围(U + 0000 – U + 00FF)的字符进行处理)。这意味着您必须使用面向mb_的函数。

Want to know more? Seriously, you do? I honestly thought that would scare you away. I was not prepared for that. And my time is up. Your best bet? Check out the PHP manual. Oh, and try stuff. There’s no substitute for actual experience using something.

想知道更多？说真的，你呢？老实说，我会吓到你。我没有为此做好准备。我的时间到了。你最好的选择？查阅PHP手册。哦，尝试一下。没有什么可以代替使用某些东西的实际经验了。

Image via Fotolia

图片来自Fotolia

翻译自: https://www.sitepoint.com/working-with-multibyte-strings/

字节字符串字节字符串

culi3182

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
字节字符串字节字符串_使用多字节字符串

字节字符串字节字符串A written language, whether it’s English, Japanese, or whatever else, consists of a number of characters, so an essential problem when working with a language digitally is to find a way to ...
复制链接

扫一扫