编辑:现在我知道更好
使用regexp来解决这种问题是a bad idea,可能会导致不可维护和不可靠的代码。更好地使用HTML parser。
解决方案与regexp
在这种情况下,最好将过程分为两部分:
>获取所有的img标签
>提取他们的元数据
我将假设你的文档不是xHTML严格,所以你不能使用XML解析器。例如。与此网页源代码:
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
[...]
)
)
然后我们得到所有的img标签属性与一个循环:
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
正则表达式是CPU密集型的,因此您可能想要缓存此页面。如果没有缓存系统,您可以使用ob_start和从文本文件加载/保存自己调整。
这个东西如何工作?
首先,我们使用preg_ match_ all,一个函数获取每个字符串匹配的模式和输出它的第三个参数。
正则表达式:
]+>
我们将其应用于所有html网页。它可以读为每个以“< img”开头的字符串,包含非“>” char并以>结尾。
(alt|title|src)=("[^"]*")
我们连续地应用它在每个img标签。它可以读为每个以“alt”,“title”或“src”开头的字符串,然后是“=”,然后是“”,一串不是“ 。隔离()之间的子字符串。
最后,每次你想要处理正则表达式,它都方便有好的工具来快速测试他们。检查这个online regexp tester。
编辑:回答第一个评论。
这是真的,我没有想到(希望很少)人使用单引号。
好吧,如果你只使用’,只是替换所有的“by”。
如果你混合两者。首先你应该拍下自己:-),然后尝试使用(“|”)或“和[^ø]来替换[^”]。