I am parsing a number of HTML documents, and within each need to try and extract a UK postal address. In order to do so I am parsing the HTML with AngleSharp and then looking for nodes with TextContent that match my RegEx:
var parser = new HtmlParser();
var source = "
Test TitleSome example source
This is a paragraph element and example postode EC1A 4NP";
var document = parser.Parse(source);
Regex searchTerm = new Regex("([A-PR-UWYZ][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)");
var list = document.All.Where(m => searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));
This returns 3 results, the html, body and p elements. The only element I want to return is the p element as that has the innerText matching the regex correctly. There may also be more than one match on a page so I can't just return the last result. I am looking to just return any elements where the text in that element (not in any child nodes) matches the regex.
Edit
I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string
Talk1:
Do you know in advance that it will be "P" or you need any node with text-only content that contain your info (for just "P" sample page github.com/AngleSharp/AngleSharp/wiki/Examples provides enough details).
Talk2:
I don't know what tag the address will be contained within - it could be P, DIV, DD etc
Solutions1
If you are looking to extract a particular node within a well-formed HTML/XML document then have a look at utilising XPath. There's some examples here on MSDN
You can use utilities libraries such as HTML Tidy to "clean-up" the html and make it well formed if it isn't already.
Talk1:
Absolutely not related to the question - OP already uses HtmlParser to read HTML - it will have exactly the same problem with any parser that produces a tree.
Talk2:
From the snippet it shows that he's running the regex against the complete document. Using XPath will take them straight to the element that contains the address they need to parse.
Talk3:
So provide an answer - what post has so far is semi-related comment. I can't see how one can easily build XPath to unknown node (which you seem to suggest, but I could be totally wrong).
Talk4:
I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string
Solutions2
Ok, I took a different approach in the end. I searched the HTML doc as a string with the RegEx NOT to parse the HTML but simply to find the exact match value. once I had that value it was simple enough to use an xpath expression to return the node. In the example above, the regex search returns EC1A 4NP and the following XPATH:
//*[contains(text(),'EC1A 4NP')]
returns the required node. For XPath ease, I switched from AngleSharp to HtmlAgilityPack for the HTML parsing
Solutions3
I've had a quick look at the doco of parser. Below is what you need to do if you want to check only the text in
tags.
var list = document.All.Where(m => m.LocalName.ToUpper() == "P" && searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));
Talk1:
I don't know what tag the address will be contained within - it could be P, DIV, DD etc