现在我更清楚了
使用regexp解决这类问题是坏主意并可能导致无法维护和不可靠的代码。最好用HTML解析器.
用regexp的溶液
在这种情况下,最好将流程分为两部分:获取所有IMG标记
提取它们的元数据
我假设您的文档不是严格的xHTML,所以您不能使用XML解析器。例如,使用此网页的源代码:/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */preg_match_all('/]+>/i',$html, $result); print_r($result);Array(
[0] => Array
(
[0] =>
[1] =>
ain to undo)" />
[2] =>
(click again to undo)" />
[3] =>
width=32 alt="gravatar image" />
[4] =>
o undo)" />[...]
))
然后,我们使用一个循环获得所有IMG标记属性:$img = array();foreach( $result as $img_tag){
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);}print_r($img);Array(
[] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src [1] => alt )
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src [1] => alt [2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[]
=> Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src [1] => alt [2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[
alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src [1] => alt )
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
))
Regexp是CPU密集型的,因此您可能需要缓存此页面。如果没有缓存系统,则可以使用OB起动并从文本文件加载/保存。
这些东西是怎么工作的?
首先,我们使用预匹配一个函数,它获取与模式匹配的每个字符串,并将其输出到它的第三个参数中。
雷杰普:]+>
我们将其应用于所有html网页。它可以理解为每一个以“”字符,以a>结尾。.(alt|title|src)=("[^"]*")
我们将它依次应用于每个IMG标签上。它可以理解为每一个以“alt”、“title”或“src”开头的字符串,然后是“=”,然后是‘,一堆不是’并以‘“结尾的东西。.
最后,每次您想要处理regexp时,都可以方便地拥有快速测试它们的好工具。看看这个在线检验仪.
编辑:回答第一个评论。
的确,我没有想到使用单引号的人(希望很少)。
嗯,如果你只使用‘,只需替换所有的“by”。
如果你把两者混合在一起。首先,你应该拍打自己:-),然后试着用“或”和“来代替[^”]。