在网上找了些关于PHP如何获取文件编码的例子。
大至如下
define ('UTF32_BIG_ENDIAN_BOM' , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM' , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM' , chr(0xEF) . chr(0xBB) . chr(0xBF));
function detect_utf_encoding($text) {
$first2 = substr($text, 0, 2);
$first3 = substr($text, 0, 3);
$first4 = substr($text, 0, 3);
if ($first3 == UTF8_BOM) return 'UTF-8';
elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
function getFileEncoding($str){
$encoding=mb_detect_encoding($str);
if(empty($encoding)){
$encoding=detect_utf_encoding($str);
}
return $encoding;
}
$file = 'text1.txt';
echo getFileEncoding(file_get_contents($file)); // 输出ASCII
echo '';
$file = 'text2.txt';
echo getFileEncoding(file_get_contents($file)); // 输出UTF-8
echo '';
$file = 'text3.txt';
echo getFileEncoding(file_get_contents($file)); // 输出UTF-16LE
echo '';
但发现这个例子对于我的一些文件检测有问题。
附件中的例子就有问题。
以下为代码:
<?php
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
define ('UTF32_BIG_ENDIAN_BOM' , chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF));
define ('UTF32_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00));
define ('UTF16_BIG_ENDIAN_BOM' , chr(0xFE) . chr(0xFF));
define ('UTF16_LITTLE_ENDIAN_BOM', chr(0xFF) . chr(0xFE));
define ('UTF8_BOM' , chr(0xEF) . chr(0xBB) . chr(0xBF));
function detect_utf_encoding($text) {
$first2 = substr($text, 0, 2);
$first3 = substr($text, 0, 3);
$first4 = substr($text, 0, 3);
if ($first3 == UTF8_BOM) return 'UTF-8';
elseif ($first4 == UTF32_BIG_ENDIAN_BOM) return 'UTF-32BE';
elseif ($first4 == UTF32_LITTLE_ENDIAN_BOM) return 'UTF-32LE';
elseif ($first2 == UTF16_BIG_ENDIAN_BOM) return 'UTF-16BE';
elseif ($first2 == UTF16_LITTLE_ENDIAN_BOM) return 'UTF-16LE';
}
function getFileEncoding($str){
$encoding=mb_detect_encoding($str);
if(empty($encoding)){
$encoding=detect_utf_encoding($str);
}
return $encoding;
}
$gbkFileContent = file_get_contents('txt/test_gbk.txt');
$utf8FileContent = file_get_contents('txt/test_utf-8.txt');
echo 'func----test_gbk_encoding:'.getFileEncoding($gbkFileContent).'<br>';
echo 'func----test_utf8_encoding:'.getFileEncoding($utf8FileContent).'<br>';
echo '<br><br>上面的好像检测不出来<br>试试下面的<br>';
echo 'mb_detect_encoding-----gbk:';
echo mb_detect_encoding($gbkFileContent, "gb2312, UTF-8").'<br>';
echo '<br>mb_detect_encoding-----utf8:';
echo mb_detect_encoding($utf8FileContent, "gb2312, UTF-8").'<br>';
echo iconv("UTF-8", "gb2312//IGNORE", $utf8FileContent);
?>
输出如下:
func----test_gbk_encoding:UTF-8
func----test_utf8_encoding:UTF-8
上面的好像检测不出来
试试下面的
mb_detect_encoding-----gbk:EUC-CN
mb_detect_encoding-----utf8:UTF-8
我是utf-8
直接利用mb_detect_encoding也是有一定问题,问题还是没有彻底解决。我这里的需求是把UTF-8的转换为其它类型,所以只要判断是UTF-8就处理,其它不处理就可以。但是如果其它编码有问题还是不能彻底解决问题。
发上来与大家讨论一下,也可能是TXT文件有问题?不标准?