mb_detect_encoding php,mb_detect_encoding

最新推荐文章于 2021-03-17 08:09:29 发布

大漠荒城史己

最新推荐文章于 2021-03-17 08:09:29 发布

阅读量315

点赞数

文章标签： mb_detect_encoding php

用户评论:

[#1]

thinkxf at gmail dot com [2014-10-08 05:32:15]

if you want to use this function ,you must loadmodule php_mbstring in the php.ini.

[#2]

emoebel at web dot de [2013-12-25 09:29:57]

if the function " mb_detect_encoding" does not exist ...

... try:

static$enclist= array('UTF-8','ASCII','ISO-8859-1','ISO-8859-2','ISO-8859-3','ISO-8859-4','ISO-8859-5','ISO-8859-6','ISO-8859-7','ISO-8859-8','ISO-8859-9','ISO-8859-10','ISO-8859-13','ISO-8859-14','ISO-8859-15','ISO-8859-16','Windows-1251','Windows-1252','Windows-1254',

);$result=false;

foreach ($enclistas$item) {$sample=iconv($item,$item,$string);

if (md5($sample) ==md5($string)) {

if ($ret===NULL) {$result=$item; } else {$result=true; }

break;

}

return$result;

}// ----------------------------------------------------------------}// ----------------------------------------------------?>

example / usage of: mb_detect_encoding()

if (mb_detect_encoding($str,'UTF-8',true) ===false) {$str=utf8_encode($str);

}

return$str;

}// ------------------------------------------------------?>

$txtstr = str_to_utf8($txtstr);

[#3]

Anonymous [2013-10-08 21:17:06]

// -----------------------------------------------------------

if(!function_exists('mb_detect_encoding')) {

function mb_detect_encoding($string, $enc=null, $ret=true) {

$out=$enc;

static $list = array('utf-8', 'iso-8859-1', 'iso-8859-15', 'windows-1251');

foreach ($list as $item) {

$sample = iconv($item, $item, $string);

if (md5($sample) == md5($string)) { $out = ($ret !== false) ? true : $item; }

}

return $out;

}

// -----------------------------------------------------------

[#4]

eyecatchup at gmail dot com [2013-06-11 10:41:41]

Just a note: Instead of using the often recommended (rather complex) regular expression by W3C (http://www.w3.org/International/questions/qa-forms-utf-8.en.php), you can simply use the 'u' modifier to test a string for UTF-8 validity:

if (preg_match("//u", $string)) {

// $string is valid UTF-8

}

[#5]

bmrkbyet at web dot de [2013-03-24 14:04:36]

a) if the FUNCTION mb_detect_encoding is not available:

### mb_detect_encoding ... iconv ###

functionmb_detect_encoding($string,$enc=null) {

static$list= array('utf-8','iso-8859-1','windows-1251');

foreach ($listas$item) {$sample=iconv($item,$item,$string);

if (md5($sample) ==md5($string)) {

if ($enc==$item) { returntrue; } else { return$item; }

}

returnnull;

}

}// -------------------------------------------?>

b) if the FUNCTION mb_convert_encoding is not available:

### mb_convert_encoding ... iconv ###

functionmb_convert_encoding($string,$target_encoding,$source_encoding) {$string=iconv($source_encoding,$target_encoding,$string);

return$string;

}

}// -------------------------------------------?>

[#6]

Gerg Tisza [2011-02-18 03:43:45]

If you try to use mb_detect_encoding to detect whether a string is valid UTF-8, use the strict mode, it is pretty worthless otherwise.

$str='????????';// ISO-8859-1mb_detect_encoding($str,'UTF-8');// 'UTF-8'mb_detect_encoding($str,'UTF-8',true);// false?>

[#7]

nat3738 at gmail dot com [2009-05-22 03:58:04]

A simple way to detect UTF-8/16/32 of file by its BOM (not work with string or file without BOM)

functiondetect_utf_encoding($filename) {$text=file_get_contents($filename);$first2=substr($text,0,2);$first3=substr($text,0,3);$first4=substr($text,0,3);

if ($first3==UTF8_BOM) return'UTF-8';

elseif ($first4==UTF32_BIG_ENDIAN_BOM) return'UTF-32BE';

elseif ($first4==UTF32_LITTLE_ENDIAN_BOM) return'UTF-32LE';

elseif ($first2==UTF16_BIG_ENDIAN_BOM) return'UTF-16BE';

elseif ($first2==UTF16_LITTLE_ENDIAN_BOM) return'UTF-16LE';

}?>

[#8]

prgss at bk dot ru [2009-03-30 02:16:22]

Another light way to detect character encoding:

static$list= array('utf-8','windows-1251');

foreach ($listas$item) {$sample=iconv($item,$item,$string);

if (md5($sample) ==md5($string))

return$item;

}

returnnull;

}?>

[#9]

matthijs at ischen dot nl [2009-03-28 10:33:20]

I seriously underestimated the importance of setlocale...

$strings= array("mais coisas a pensar sobre di??rio ou dois!","plus de choses ?? penser ?? journalier ou ?? deux !","?m??s cosas a pensar en diario o dos!","pi?? cose da pensare circa giornaliere o due!","flere ting ? tenke p? hver dag eller to!","Dal??? v??c??, p?em??let o ka?d? den nebo dva!","mehr ??ber Spa? sp?t sch?nen","m? von? gjat? fun bukur","t?bb mint sz??rakoz??s k??s? csod??latos keny??r");$convert= array();setlocale(LC_CTYPE,'de_DE.UTF-8');

foreach($stringsas$string)$convert[] =iconv('UTF-8','ASCII//TRANSLIT//IGNORE',$string);?>

Produces the following:

Array

(

[0] => mais coisas a pensar sobre diario ou dois!

[1] => plus de choses a penser a journalier ou a deux !

[2] => ?mas cosas a pensar en diario o dos!

[3] => piu cose da pensare circa giornaliere o due!

[4] => flere ting aa tenke paa hver dag eller to!

[5] => Dalsi veci, premyslet o kazdy den nebo dva!

[6] => mehr ueber Spass spaet schoenen

[7] => me vone gjate fun bukur

[8] => toebb mint szorakozas keso csodalatos kenyer

)

whereas

$convert= array();setlocale(LC_CTYPE,'nl_NL.UTF-8');

foreach($stringsas$string)$convert[] =iconv('UTF-8','ASCII//TRANSLIT//IGNORE',$string);?>

produces:

Array

(

[0] => mais coisas a pensar sobre di?rio ou dois!

[1] => plus de choses ? penser ? journalier ou ? deux !

[2] => ?m?s cosas a pensar en diario o dos!

[3] => pi? cose da pensare circa giornaliere o due!

[4] => flere ting ? tenke p? hver dag eller to!

[5] => Dal?? v?c?, p?em??let o ka?d? den nebo dva!

[6] => mehr ?ber Spass sp?t sch?nen

[7] => m? von? gjat? fun bukur

[8] => t?bb mint sz?rakoz?s k?s? csod?latos keny?r

)

This might be of interest when trying to convert utf-8 strings into ASCII suitable for URL's, and such. this was never obvious for me since I've used locales for us and nl.

[#10]

dennis at nikolaenko dot ru [2008-10-06 09:18:05]

Beware of bug to detect Russian encodings

http://bugs.php.net/bug.php?id=38138

[#11]

hmdker at gmail dot com [2008-08-23 21:58:28]

Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

for($i=0;$i

if($c>128){

if(($c>=254)) returnfalse;

elseif($c>=252)$bits=6;

elseif($c>=248)$bits=5;

elseif($c>=240)$bits=4;

elseif($c>=224)$bits=3;

elseif($c>=192)$bits=2;

else returnfalse;

if(($i+$bits) >$len) returnfalse;

while($bits>1){$i++;$b=ord($str[$i]);

if($b<128||$b>191) returnfalse;$bits--;

}

returntrue;

}?>

[#12]

yaqy at qq dot com [2008-07-20 22:14:56]

{

if(mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8")

{

returniconv("gbk","utf-8",$str);

}

else

{

return$str;

}

}?>

[#13]

rl at itfigures dot nl [2007-09-04 14:00:15]

I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset.

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){

$str=str_replace("\xE2\x82\xAC","€",$str);

$str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);

$str=str_replace("€","\x80",$str);

}

If html-output is needed the last line is not necessary (and even unwanted).

[#14]

sunggsun [2006-08-15 00:26:19]

from PHPDIG

function isUTF8($str) {

if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {

return true;

} else {

return false;

}

[#15]

chris AT w3style.co DOT uk [2006-08-03 02:22:16]

Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster.

{

returnpreg_match('%(?:

[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte

|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs

|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte

|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates

|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3

|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15

|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16

)+%xs',$string);

}?>

[#16]

telemach [2005-07-27 18:48:52]

[#17]

Chrigu [2005-03-29 07:32:23]

If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:

mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

[#18]

php-note-2005 at ryandesign dot com [2005-02-17 07:57:14]

Much simpler UTF-8-ness checker using a regular expression created by the W3C:

[\x09\x0A\x0D\x20-\x7E] # ASCII

| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte

| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs

| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte

| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates

| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3

| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15

| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16

)*$%xs',$string);

}// function is_utf8?>

[#19]

jaaks at playtech dot com [2005-01-14 00:27:05]

Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.

Replace

} // goto next char

with

} else {

return false; // 10xxxxxx occuring alone

} // goto next char

[#20]

maarten [2005-01-12 15:55:40]

Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.

To verify utf 8 use the following:

//utf8 encoding validation developed based on Wikipedia entry at:

//http://en.wikipedia.org/wiki/UTF-8

//Implemented as a recursive descent parser based on a simple state machine

//This cries out for a C-implementation to be included in PHP core

function valid_1byte($char) {

if(!is_int($char)) return false;

return ($char & 0x80) == 0x00;

}

function valid_2byte($char) {

if(!is_int($char)) return false;

return ($char & 0xE0) == 0xC0;

}

function valid_3byte($char) {

if(!is_int($char)) return false;

return ($char & 0xF0) == 0xE0;

}

function valid_4byte($char) {

if(!is_int($char)) return false;

return ($char & 0xF8) == 0xF0;

}

function valid_nextbyte($char) {

if(!is_int($char)) return false;

return ($char & 0xC0) == 0x80;

}

function valid_utf8($string) {

$len = strlen($string);

$i = 0;

while( $i

$char = ord(substr($string, $i++, 1));

if(valid_1byte($char)) {// continue

continue;

} else if(valid_2byte($char)) { // check 1 byte