查找html结点,使用RegEx查找HTML/XML节点

最新推荐文章于 2024-07-03 06:00:00 发布

weixin_39853843

最新推荐文章于 2024-07-03 06:00:00 发布

阅读量110

点赞数

文章标签：查找html结点

在解析HTML文档寻找英国邮政地址时，遇到的问题是匹配的元素不仅包含预期的`<p>`元素，还有其他不相关的元素。解决方案包括：1) 使用XPath直接定位到包含匹配邮政编码的元素；2) 先用正则找到邮政编码，然后通过XPath获取包含该编码的节点；3) 限制查询范围，如只检查`<p>`标签内的文本是否匹配。最终选择了将正则用于查找精确匹配，再用XPath获取相关节点的方法。

摘要由CSDN通过智能技术生成

I am parsing a number of HTML documents, and within each need to try and extract a UK postal address. In order to do so I am parsing the HTML with AngleSharp and then looking for nodes with TextContent that match my RegEx:

var parser = new HtmlParser();

var source = "

Test Title

Some example source

This is a paragraph element and example postode EC1A 4NP";

var document = parser.Parse(source);

Regex searchTerm = new Regex("([A-PR-UWYZ][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)");

var list = document.All.Where(m => searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));

This returns 3 results, the html, body and p elements. The only element I want to return is the p element as that has the innerText matching the regex correctly. There may also be more than one match on a page so I can't just return the last result. I am looking to just return any elements where the text in that element (not in any child nodes) matches the regex.

Edit

I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string

Talk1:

Do you know in advance that it will be "P" or you need any node with text-only content that contain your info (for just "P" sample page github.com/AngleSharp/AngleSharp/wiki/Examples provides enough details).

Talk2:

I don't know what tag the address will be contained within - it could be P, DIV, DD etc

Solutions1

If you are looking to extract a particular node within a well-formed HTML/XML document then have a look at utilising XPath. There's some examples here on MSDN

You can use utilities libraries such as HTML Tidy to "clean-up" the html and make it well formed if it isn't already.

Talk1:

Absolutely not related to the question - OP already uses HtmlParser to read HTML - it will have exactly the same problem with any parser that produces a tree.

Talk2:

From the snippet it shows that he's running the regex against the complete document. Using XPath will take them straight to the element that contains the address they need to parse.

Talk3:

So provide an answer - what post has so far is semi-related comment. I can't see how one can easily build XPath to unknown node (which you seem to suggest, but I could be totally wrong).

Talk4:

Solutions2

Ok, I took a different approach in the end. I searched the HTML doc as a string with the RegEx NOT to parse the HTML but simply to find the exact match value. once I had that value it was simple enough to use an xpath expression to return the node. In the example above, the regex search returns EC1A 4NP and the following XPATH:

//*[contains(text(),'EC1A 4NP')]

returns the required node. For XPath ease, I switched from AngleSharp to HtmlAgilityPack for the HTML parsing

Solutions3

I've had a quick look at the doco of parser. Below is what you need to do if you want to check only the text in

tags.

var list = document.All.Where(m => m.LocalName.ToUpper() == "P" && searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));

Talk1:

I don't know what tag the address will be contained within - it could be P, DIV, DD etc

weixin_39853843

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫