查找html结点,使用RegEx查找HTML/XML节点

在解析HTML文档寻找英国邮政地址时,遇到的问题是匹配的元素不仅包含预期的`<p>`元素,还有其他不相关的元素。解决方案包括:1) 使用XPath直接定位到包含匹配邮政编码的元素;2) 先用正则找到邮政编码,然后通过XPath获取包含该编码的节点;3) 限制查询范围,如只检查`<p>`标签内的文本是否匹配。最终选择了将正则用于查找精确匹配,再用XPath获取相关节点的方法。
摘要由CSDN通过智能技术生成

I am parsing a number of HTML documents, and within each need to try and extract a UK postal address. In order to do so I am parsing the HTML with AngleSharp and then looking for nodes with TextContent that match my RegEx:

var parser = new HtmlParser();

var source = "

Test Title

Some example source

This is a paragraph element and example postode EC1A 4NP";

var document = parser.Parse(source);

Regex searchTerm = new Regex("([A-PR-UWYZ][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)");

var list = document.All.Where(m => searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));

This returns 3 results, the html, body and p elements. The only element I want to return is the p element as that has the innerText matching the regex correctly. There may also be more than one match on a page so I can't just return the last result. I am looking to just return any elements where the text in that element (not in any child nodes) matches the regex.

Edit

I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string

Talk1:

Do you know in advance that it will be "P" or you need any node with text-only content that contain your info (for just "P" sample page github.com/AngleSharp/AngleSharp/wiki/Examples provides enough details).

Talk2:

I don't know what tag the address will be contained within - it could be P, DIV, DD etc

Solutions1

If you are looking to extract a particular node within a well-formed HTML/XML document then have a look at utilising XPath. There's some examples here on MSDN

You can use utilities libraries such as HTML Tidy to "clean-up" the html and make it well formed if it isn't already.

Talk1:

Absolutely not related to the question - OP already uses HtmlParser to read HTML - it will have exactly the same problem with any parser that produces a tree.

Talk2:

From the snippet it shows that he's running the regex against the complete document. Using XPath will take them straight to the element that contains the address they need to parse.

Talk3:

So provide an answer - what post has so far is semi-related comment. I can't see how one can easily build XPath to unknown node (which you seem to suggest, but I could be totally wrong).

Talk4:

I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string

Solutions2

Ok, I took a different approach in the end. I searched the HTML doc as a string with the RegEx NOT to parse the HTML but simply to find the exact match value. once I had that value it was simple enough to use an xpath expression to return the node. In the example above, the regex search returns EC1A 4NP and the following XPATH:

//*[contains(text(),'EC1A 4NP')]

returns the required node. For XPath ease, I switched from AngleSharp to HtmlAgilityPack for the HTML parsing

Solutions3

I've had a quick look at the doco of parser. Below is what you need to do if you want to check only the text in

tags.

var list = document.All.Where(m => m.LocalName.ToUpper() == "P" && searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));

Talk1:

I don't know what tag the address will be contained within - it could be P, DIV, DD etc

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值