MagPieRSS中UTF-8和GBK的RSS解析分析(附:php中的面向字符编程详解)

第一次尝试MagpieRSS,因为没有安装iconv和mbstring,所以失败了,今天在服务器上安装了iconv和mbstring的支持,今天仔细看了一下lilina中的rss_fetch的用法:最重要的是制定RSS的输出格式为'MAGPIE_OUTPUT_ENCODING' = 'UTF-8'

样例代码如下:
<?php
// $Id$
// including
require_once("rss_fetch.inc");

// specify output encoding default is ISO-8859-1
define('MAGPIE_OUTPUT_ENCODING', 'UTF-8');;
define('MAGPIE_FETCH_TIME_OUT', 60 * 180);

$url = $_GET['url'];
$rss = fetch_rss($url);

print_r($rss);
?>

在gRaSSland设计中,我看了一下print_r的输出:发现rss 1.0 rss 2.0 atom在作者属性和文章摘要属性方面还是有一些出入的。

列表如下:

RSS 1.0
RSS 2.0
Atom
magpierss Object
(
[parser] => Resource id #10
[current_item] => Array
(
)

[items] => Array
(
[0]
=> Array

(

[about] =>
http://blog.cnblog.org/demo/archives/2004/12/another_story.html

[title] => another story

[link] =>
http://blog.cnblog.org/demo/archives/2004/12/another_story.html

[description] => foo

[dc] => Array

(

[subject] => foo

[creator] => chedong

[date] => 2004-12-17T22:24:41+08:00

)


[summary] => foo

[date_timestamp] => 1103293440

)

[1]
=> Array

(

[about] =>
http://blog.cnblog.org/demo/archives/2004/12/grassland_demo.html

[title] => GrassLand demo

[link] =>
http://blog.cnblog.org/demo/archives/2004/12/grassland_demo.html

[description] => body text with <b>html</b>

[dc] => Array

(

[subject] => foo

[creator] => chedong

[date] => 2004-12-17T22:18:17+08:00

)


[summary] => body text with <b>html</b>

[date_timestamp] => 1103293080

)

)

[channel] => Array
(

[title] => demo

[link] => http://blog.cnblog.org/demo/
[dc]
=> Array

(

[date] => 2004-12-17T22:24:41+08:00

)


[items] =>

[items_seq] =>

[tagline] =>
)

[textinput] => Array
(
)

[image] => Array
(
)

[feed_type] => RSS
[feed_version] => 1.0
[encoding] => UTF-8
[_source_encoding] =>
[ERROR] =>
[WARNING] =>
[_CONTENT_CONSTRUCTS] => Array
(
[0]
=> content
[1]
=> summary
[2]
=> info
[3]
=> title
[4]
=> tagline
[5]
=> copyright
)

[_KNOWN_ENCODINGS] => Array
(
[0]
=> UTF-8
[1]
=> US-ASCII
[2]
=> ISO-8859-1
)

[stack] => Array
(
)

[inchannel] =>
[initem] =>
[incontent] =>
[intextinput] =>
[inimage] =>
[current_field] =>
[current_namespace] =>
[source_encoding] => GB2312
[last_modified] => Fri, 17 Dec 2004 14:29:18 GMT

[etag] => "c2ab4-615-41c2ed3e"

)

magpierss Object
(
[parser] => Resource id #10
[current_item] => Array
(
)

[items] => Array
(
[0]
=> Array

(

[title] => another story

[description] => foo

[link] =>
http://blog.cnblog.org/demo/archives/2004/12/another_story.html

[guid] =>
http://blog.cnblog.org/demo/archives/2004/12/another_story.html

[category] => foo

[pubdate] => Fri, 17 Dec 2004 22:24:41 +0800

[summary] => foo

[date_timestamp] => 1103293481

)

[1]
=> Array

(

[title] => GrassLand demo

[description] => body text with <b>html</b>

[link] =>
http://blog.cnblog.org/demo/archives/2004/12/grassland_demo.html

[guid] =>
http://blog.cnblog.org/demo/archives/2004/12/grassland_demo.html

[category] => foo

[pubdate] => Fri, 17 Dec 2004 22:18:17 +0800

[summary] => body text with <b>html</b>

[date_timestamp] => 1103293097

)

)

[channel] => Array
(

[title] => demo

[link] => http://blog.cnblog.org/demo/

[copyright] => Copyright 2004

[lastbuilddate] => Fri, 17 Dec 2004 22:24:41 +0800

[generator] => http://www.movabletype.org/?v=3.11

[docs] => http://blogs.law.harvard.edu/tech/rss

[tagline] =>
)

[textinput] => Array
(
)

[image] => Array
(
)

[feed_type] => RSS
[feed_version] => 2.0
[encoding] => UTF-8
[_source_encoding] =>
[ERROR] =>
[WARNING] =>
[_CONTENT_CONSTRUCTS] => Array
(
[0]
=> content
[1]
=> summary
[2]
=> info
[3]
=> title
[4]
=> tagline
[5]
=> copyright
)

[_KNOWN_ENCODINGS] => Array
(
[0]
=> UTF-8
[1]
=> US-ASCII
[2]
=> ISO-8859-1
)

[stack] => Array
(
)

[inchannel] =>
[initem] =>
[incontent] =>
[intextinput] =>
[inimage] =>
[current_field] =>
[current_namespace] =>
[source_encoding] => GB2312
[last_modified] => Fri, 17 Dec 2004 14:29:19 GMT

[etag] => "c2ab3-40f-41c2ed3f"

)

magpierss Object
(
[parser] => Resource id #10
[current_item] => Array
(
)

[items] => Array
(
[0]
=> Array

(

[title] => another story

[link] =>
http://blog.cnblog.org/demo/archives/2004/12/another_story.html

[modified] => 2004-12-17T14:27:38Z

[issued] => 2004-12-17T14:24:41Z

[id] => tag:blog.cnblog.org,2004:/demo/6.3690

[created] => 2004-12-17T14:24:41Z

[summary] => foo

[author] =>

[author_name] => chedong

[author_url] => http://www.chedong.com

[author_email] => chedong@hotmail.com

[dc] => Array

(

[subject] => foo

)


[atom_content] =>
<p>foo</p>
<p>bar</p>


[description] => foo

[content] => Array

(

[encoded] =>
<p>foo</p>
<p>bar</p>


)


[date_timestamp] => 1103293620

)

[1]
=> Array

(

[title] => GrassLand demo

[link] =>
http://blog.cnblog.org/demo/archives/2004/12/grassland_demo.html

[modified] => 2004-12-17T14:27:39Z

[issued] => 2004-12-17T14:18:17Z

[id] => tag:blog.cnblog.org,2004:/demo/6.3689

[created] => 2004-12-17T14:18:17Z

[summary] => body text with html

[author] =>

[author_name] => chedong

[author_url] => http://www.chedong.com

[author_email] => chedong@hotmail.com

[dc] => Array

(

[subject] => foo

)


[atom_content] =>
<p>body text with <b>html</b><br />
modified by Che Dong</p>
<p>extended entry with <b>html</b></p>


[description] => body text with html

[content] => Array

(

[encoded] =>
<p>body text with <b>html</b><br />
modified by Che Dong</p>
<p>extended entry with <b>html</b></p>


)


[date_timestamp] => 1103293620

)

)

[channel] => Array
(

[title] => demo

[link] => http://blog.cnblog.org/demo/

[modified] => 2004-12-17T14:27:38Z
[id]
=> tag:blog.cnblog.org,2004:/demo/6

[generator] => Movable Type

[copyright] => Copyright (c) 2004, chedong

[description] =>
)

[textinput] => Array
(
)

[image] => Array
(
)

[feed_type] => Atom
[feed_version] => 0.3
[encoding] => UTF-8
[_source_encoding] =>
[ERROR] =>
[WARNING] =>
[_CONTENT_CONSTRUCTS] => Array
(
[0]
=> content
[1]
=> summary
[2]
=> info
[3]
=> title
[4]
=> tagline
[5]
=> copyright
)

[_KNOWN_ENCODINGS] => Array
(
[0]
=> UTF-8
[1]
=> US-ASCII
[2]
=> ISO-8859-1
)

[stack] => Array
(
)

[inchannel] =>
[initem] =>
[incontent] =>
[intextinput] =>
[inimage] =>
[current_field] =>
[current_namespace] =>
[source_encoding] => GB2312
[last_modified] => Fri, 17 Dec 2004 14:29:19 GMT

[etag] => "c2ab0-776-41c2ed3f"

)


因此在RSS抓取的过程中:映射author,需要根据不同版本进行映射:
foreach ($rss->items as $item) {
if ($rss->feed_type == "RSS" && $rss->feed_version == "1.0") {
$item['author'] = $item['dc']['creator'];
}
else if ($rss->feed_type == "Atom") {
$item['author'] = $item['author_name'];
}
else {
$item['author'] = $rss->channel['title'];
}
// print_r($item);
$sql = "INSERT INTO `grassland` ( `url` , `title` , `author` , `content` , `pubdate` , `author_url` , `author_rss` ) VALUE
S ('" .$item['link'] .
"' , '" . mysql_escape_string($item['title']) .
"' , '" . mysql_escape_string($item['author']) .
"' , '" . mysql_escape_string($item['description']) .
"' , '" . mysql_escape_string($item['date_timestamp']) .
"' , '" . mysql_escape_string($rss->channel['link']) .
"' , '" . mysql_escape_string($url) . "') ";
// print $sql;
$result = mysql_query($sql);
}


学习一下Steve的解决过程,翻译自:
http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss

....
问题:你如何知道XML是什么字符集?唯一的答案就是你自己扫描XML头然后判断是什么字符集,代码如下:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}

正则表达式发现XML自己的字符集。如果发现,就记录到$encoding 如果没有发现就当成UTF-8(也是XML缺省的字符集),完整代码如下:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

成了,所有的FEED转成UTF-8了,轮到BIG-5了,再次查看了PHP文档和源代码发现PHP 4.x只支持UTF-8, ISO-8859-1 和 US-ASCII,所以轮到 BIG5或SHIFT-JIS还是会乱码, PHP 5也没用(译注:我也尝试过使用php 5)在PHP 5正式发布时会包含BIG5和GB2312这2种主要的中文编码,在PHP文档中搜索了一下,找到了一个潜在的解决方案 mbstring() mbstring系列函数支持一个巨长的字符集列表,并可以进行之间的相互转换

最后的解决方案:用regex 发现数据源字符集,如果PHP自己不能解决就在解析前用mb_convert_encoding将其转换成UTF-8,然后按UTF-8解析,但是解析不了的可能性还是非常高的。我试了一下代码:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}

if($encoding == "UTF-8" || $encoding == "US-ASCII” || $encoding == "ISO-8859-1") {
$parser = xml_parser_create($encoding);
} else {

if(function_exists('mb_convert_encoding')) {
$encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
}

if($encoded_source != NULL) {
$source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
}

$parser = xml_parser_create("UTF-8");
}

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

:) HACK成功,解析ISO-8859-15, BIG-5, 甚至 GB2312 都没有问题,可以将其全部转成UTF-8并在同一个页面中展现。。。我可以宣布这是一个非常艺术的PHP XML字符集声明解析方案,最好能在PHP 4.x中加入。


这多语言字符集支持方面Java等商业软件都做的比较好,我以前做过一些试验,说明Java对字符集的支持机制。有了iconv和mbstring的支持:php应用也终于可以面向字符编程,而不是面向字节编程了……

计划在GrassLand的RSS中使用MagPieRSS做为RSS解析工具,下一步进行网页抓取同步和数据增加添加的工作。

2004-12-24
RSS版本和使用的字符集统计
使用MT作为发布系统一般会生成3个FEED文件,分别是RSS 1.0/2.0和Atom 0.3
http://blog.cnblog.org/index.xml RSS 2.0 GBK
http://blog.cnblog.org/index.rdf RSS 1.0 GBK
http://blog.cnblog.org/atom.xml Atom 0.3 GBK

通过对gRaSSland目前注册的数百的RSS的统计:RSS占绝对主流
357 RSS
7
对应的RSS版本为:
222 1.0
122 2.0
11 0.92
7
4 0.91
使用的发布语言:却以GB2312为主。
233 GB2312
79 UTF-8
54
2 ISO-8859-1

RSS解析器的容错性显得非常重要。

作者: 车东 发表于:2004-12-12 22:12 最后更新于:2007-04-15 19:04
版权声明:可以转载,转载时请务必以超链接形式标明文章 的原始出处和作者信息及 本版权声明

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值