php绕过正则表达式,关于php：通过跳过锚标记来检查正则表达式

jacknrose

于 2021-03-27 11:32:29 发布

阅读量346

点赞数

文章标签： php绕过正则表达式

我已经写了一个用于搜索特定关键字的正则表达式，并用特定的URL替换了该关键字。

我当前的正则表达式为：\b$keyword\b

其中的一个问题是，如果我的数据包含锚标记，而该标记包含此关键字，则此正则表达式也会替换锚标记中的该关键字。

我想搜索给定的数据，但锚标记除外。请帮帮我。感谢您的帮助。

例如。关键字：迪士尼

I / P：

This is Disney The disney should be replaceable

预期O / p：

This is Disney The disney should be replaceable

无效的o / p：

This is Disney The disney should be replaceable

我不禁注意到您的预期字符串和无效字符串相同。

@Grim：谢谢。我已经改正了

这个问题基本上与您的相同：stackoverflow.com/questions/1315653/

我已经修改了在页面上突出显示搜索词组的功能，在这里您可以：

$html = 'This is Disney The disney should be replaceable.'.PHP_EOL;

$html .= 'Let\'s test also use of keyword inside other tags, for example as class name:'.PHP_EOL;

$html .= ' - this should not be replaced with link, and it isn\'t!'.PHP_EOL;

$result = ReplaceKeywordWithLink($html,"disney","any-url.php");

echo nl2br(htmlspecialchars($result));

function ReplaceKeywordWithLink($html, $keyword, $link)

{

if (strpos($html,"

$id = 0;

$unique_array = array();

// Hide existing anchor tags with some unique string.

preg_match_all("#]*>[\s\S]*?#i", $html, $matches);

foreach ($matches[0] as $tag) {

$id++;

$unique_string ="@@@@@$id@@@@@";

$unique_array[$unique_string] = $tag;

$html = str_replace($tag, $unique_string, $html);

}

// Hide all tags by replacing with some unique string.

preg_match_all("#]+>#", $html, $matches);

foreach ($matches[0] as $tag) {

$id++;

$unique_string ="@@@@@$id@@@@@";

$unique_array[$unique_string] = $tag;

$html = str_replace($tag, $unique_string, $html);

}

}

// Then we replace the keyword with link.

$keyword = preg_quote($keyword);

assert(strpos($keyword, '$') === false);

$html = preg_replace('#(\b)('.$keyword.')(\b)#i', '$1$2$3', $html);

// We get back all the tags by replacing unique strings with their corresponding tag.

if (isset($unique_array)) {

foreach ($unique_array as $unique_string => $tag) {

$html = str_replace($unique_string, $tag, $html);

}

}

return $html;

}

结果：

This is Disney The disney should be replaceable.

Let's test also use of keyword inside other tags, for example as class name:

- this should not be replaced with link, and it isn't!

将此添加到正则表达式的末尾：

(?=[^

该超前尝试尝试匹配下一个打开的标记或输入的结尾，但前提是它首先看不到关闭的标记。假设HTML的结构最少，那么只要匹配在标记的开头之后和相应的标记之前开始，预查就会失败。

为了防止它与其他任何标记(例如)匹配，您还可以添加以下超前功能：

(?![^<>]*+>)

通过这一步骤，我假设标签的属性值中没有任何尖括号，根据HTML 4规范这是合法的，但在现实世界中极为罕见。

如果您以PHP双引号字符串的形式编写正则表达式(如果希望替换$keyword变量，则必须使用双引号字符串)，应将所有反斜杠加倍。 \z可能不是问题，但我相信\b将被解释为退格，而不是单词边界断言。

编辑：经过深思熟虑，绝对可以添加第二个超前行为-我的意思是，为什么不想阻止标记内的匹配？并将其放在第一位，因为它的评估往往比其他评估更快：

(?![^<>]*+>)(?=[^

您是否测试过这些正则表达式？您的解决方案似乎很有趣，但我认为它实际上不会起作用，可能还有更多的例外，正则表达式的1行无法为它们全部服务。

@Czarek，有很多事情可能导致此解决方案失败，包括CDATA节，SGML注释，和元素，属性值中的尖括号，以及无效的HTML。但是，几乎所有基于正则表达式的解决方案(包括您的解决方案)都是如此。严格来说，用正则表达式处理HTML是不可能的，但是我们还是这样做。只要您知道其局限性，此技术便与任何技术一样安全。我已经使用了很多次。

@Czarek：...但是您的答案也+1。我从未见过这种技术如此彻底地实现过。

首先剥离标签，然后搜索剥离的文本。

不过，如果他想保留标签，那就不好了。

我想要那些锚标签在数据中，所以我不能为此使用strip_tags

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。