html 文本提取正则,正则表达式从HTML中提取文本

最新推荐文章于 2024-07-18 09:52:46 发布

一代目

最新推荐文章于 2024-07-18 09:52:46 发布

阅读量908

点赞数

文章标签： html 文本提取正则

12 个答案:

答案 0 :(得分：15)

删除javascript和CSS：

删除标签

答案 1 :(得分：11)

您无法使用正则表达式真正解析HTML。这太复杂了。 RE根本不会正确处理)可以在浏览器中作为正确的文本使用，但可能会让一个天真的RE感到困惑。

使用正确的HTML解析器，您会更快乐，更成功。 Python人经常使用Beautiful Soup来解析HTML并删除标签和脚本。

此外，浏览器在设计上容忍格式错误的HTML。因此，您经常会发现自己试图解析明显不合适的HTML，但在浏览器中运行正常。

您可以使用RE解析错误的HTML。它需要的只是耐心和努力。但是使用别人的解析器通常更简单。

答案 2 :(得分：6)

需要一个正则表达式解决方案(在php中)，它将返回纯文本(或者比PHPSimpleDOM更好)，但速度要快得多。以下是我提出的解决方案：

function plaintext($html)

{

// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).

$plaintext = preg_replace('##s', '', $html);

// put a space between list items (strip_tags just removes the tags).

$plaintext = preg_replace('##', ' ', $plaintext);

// remove all script and style tags

$plaintext = preg_replace('#]*>(.*?)(script|style)>#is', "", $plaintext);

// remove br tags (missed by strip_tags)

$plaintext = preg_replace("#
]*?>#", " ", $plaintext);

// remove all remaining html

$plaintext = strip_tags($plaintext);

return $plaintext;

}

当我在一些复杂的网站上测试它时(论坛似乎包含了一些更难解析的html)，这个方法返回了与PHPSimpleDOM明文相同的结果，只是更快，更快。它还正确处理了列表项(li标签)，而PHPSimpleDOM没有。

至于速度：

SimpleDom：0.03248秒。

RegEx：0.00087 sec。

快37倍！

答案 3 :(得分：4)

考虑使用正则表达式这样做是令人生畏的。你考虑过XSLT吗？ XPath表达式，用于提取XHTML文档中的所有文本节点，减去脚本＆amp;风格内容，将是：

//body//text()[not(ancestor::script)][not(ancestor::style)]

答案 4 :(得分：2)

简单HTML的最简单方法(Python中的示例)：

text = "

This is my> exampleHTML,
containing tags

"

import re

" ".join([t.strip() for t in re.findall(r"]+>|[^

返回：

'This is my> example HTML, containing tags'

答案 5 :(得分：2)

这是删除最复杂的html标签的功能。

function strip_html_tags( $text )

{

$text = preg_replace(

array(

// Remove invisible content

'@

]*?>.*?@siu',

'@@siu',

'@@siu',

'@]*?.*?@siu',

'@]*?.*?@siu',

'@]*?.*?@siu',

'@]*?.*?@siu',

'@

]*?.*?@siu',

'@]*?.*?@siu',

// Add line breaks before & after blocks

'@

'@?((address)|(blockquote)|(center)|(del))@iu',

'@?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',

'@?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',

'@?((table)|(th)|(td)|(caption))@iu',

'@?((form)|(button)|(fieldset)|(legend)|(input))@iu',

'@?((label)|(select)|(optgroup)|(option)|(textarea))@iu',

'@?((frameset)|(frame)|(iframe))@iu',

),

array(

' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',

"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",

"\n\$0", "\n\$0",

),

$text );

// Remove all remaining tags and comments and return.

return strip_tags( $text );

}

答案 6 :(得分：1)

你不能只使用C＃提供的WebBrowser控件吗？

System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser();

wc.DocumentText = "

blah blah foo";

System.Windows.Forms.HtmlDocument h = wc.Document;

Console.WriteLine(h.Body.InnerText);

答案 7 :(得分：1)

使用perl语法定义正则表达式，一个开头可能是：

!

(.*)

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。