Parsing the HTML Document
Now that we have a way to populate the class, let’s take a look at making sense of what’s there. If you’re not familiar with Regular Expressions, you will have need to look them up to fully understand what follows. There have been numerous books written on regular expressions, and even a quick look at the basics is beyond the scope of what space permits here. However, we can take a look at a very simple example that will at least whet your appetite for more. We will use several regular expressions to parse the HTML Document, but the simplest is the following:<[^>]>This tells the regular expression engine to find a pattern that begins with a less-than sign, ends with a greater-than sign and includes everything in between. This regular expression will match any tag. Now let’s look at matching an entire block, such as a <HEAD></HEAD> entry. We can match everything in the HEAD block with the following regular expression:
<HEAD[^>]*>.*</HEAD/s*>A special note is warranted here. You need to be aware of the conventions of the language in question when assigning strings such as the one above. C# will normally interpret the “/s” portion as an escape sequence. Embedded quotes are also a potential issue with either VB.Net or C#. With C#, it is possible to use an extra slash (“//s”) but it is generally easier to prefix the string with an at-sign (“@”), which will tell C# to ignore escape sequences and is similar to how a quoted string is encoded by VB.Net. The following table illustrates the point:
Desired Regular Expression | /s”/n |
C# (Method 1) | “//s/”//n” |
C# (Method 2) | @”/s””/s” |
VB.Net | “/s””/n” |
Using the power of regular expressions, we can extract the raw text from our document using only a few lines of code. This type of regular expression for the HEAD tag will work for most, but not all, HTML tags. Some tags do not need an explicit closing tag – such as <BR>. Other tags, such as a comment tag, use a closing tag which is not in the standard form – a slash followed by the tag name. We will deal with the problem of non-closed blocks later, but for now we will write a function that will accept a tag name, returning the proper regular expression that will match that tag’s contents, assuming there is a corresponding closing tag:
private string GetExpressionForTagContents (string strTagName) { string strPatternTag; if (strTagName == “!”) strPatternTag = “<!.*?->”; else if (string.Compare(strTagName, “!doctype”, true) == 0) strPatternTag = “<!doctype.*?>”; else if (String.Compare( strTagName, “br”, true) == 0) strPatternTag = @”<br/s*/?/s*>”; else strPatternTag = @”<(“ + strTagName + @”)(>|/s+[^>]*>).*?<//1/s*>”; return(strPatternTag); }If there no closing tag is present, this match will fail and if so, we will take everything that lies between the tag in question and the less-than sign – or end of file. Let’s put all of this together and write a routine that will accept any tag name, and return the contents irrespective of the existence of a closing tag.
private string GetTagByName (string strTagName, string strSource) { string strPatternTag = GetExpressionForTagContents( strTagName); string strPatternTagNoClose = “<” + strTagName + @”(>|/s+[^>]*>)[^<]”; RegexOptions opts = RegexOptions.IgnoreCase | RegexOptions.Singleline; Match m; string strGetTagByName; m = System.Text.RegularExpressions. Regex.Match(strSource, strPatternTag, opts); if (m.Value == “”) { m = System.Text.RegularExpressions. Regex.Match(strSource, strPatternTagNoClose, opts); if (m == null) strGetTagByName = strSource; else strGetTagByName = m.Value; } else strGetTagByName = m.Value; return(strGetTagByName); }We can now pull out any single tag, or the first of multiple tags, given its name. As you may have noticed, the above function was declared as private so it is for in-class use only. Soon we will write a public function called GetTagsByName that will return an array of tag contents and which will be suitable for returning the data for any tag, whether it is found one time or many times. First, let’s add properties to return the HEAD, TITLE and BODY. For the HEAD and TITLE properties, we can just return the results of GetTagByName, passing the appropriate tag name. For the BODY, we need to be a bit more careful. It is possible that we could encounter a document with no HEAD or BODY tag at all. In such a case, GetTagByName will return a null string and will simply return the entire document:
public string Head { get { return(GetTagByName( “Head”, m_strSource)); } } public string Title { get { return(GetTagByName( “Title”, m_strSource)); } } public string Body { get { string strBody = GetTagByName(“Body”, m_strSource); if (strBody == “”) strBody = this.Source; return(strBody); } }
Extracting the Text
We now have everything to pull out the raw text from the document. The text will be found in the BODY portion of the document so we will start with this. Then any comments plus anything found between SCRIPT blocks will be stripped. Next we will remove all remaining tags. Everything left over should be the textual content of the page.Having extracted the text from the surrounding tags, we still have the possibility of encoded ISO characters like the famous &NBSP;. We will write a routine (not listed) to search for each of these and replace the ISO character with its corresponding ASCII equivalent. Lastly, use a while loop to remove any repeated blanks. The code listing for the Text property as well as a helper function to strip the comments appears below.
public string Text { get { Regex r = new Regex(“”); RegexOptions opts = RegexOptions.IgnoreCase | RegexOptions.Singleline; string strText = this.Body; strText = StripComments(strText); string strPattern = GetExpressionForTagContents( “SCRIPT”); strText = Regex.Replace(strText, strPattern, “”, opts); strPattern = @”<[^>]*>”; strText = Regex.Replace(strText, strPattern, “ “, opts); strText = ISOtoASCII(strText); strText = Regex.Replace( strText,”&”,”&”,opts); System.Text. RegularExpressions. MatchCollection m; do { strText = Regex.Replace(strText, @”/s/s”, “ “); m = Regex.Matches( strText, @”/s/s”); } while (m.Count > 0); return(strText.Trim()); } } string StripComments() { Regex r = new Regex( GetExpressionForTagContents( “!”)); return( r.Replace(m_strSource, “”)); }