This project I post today really has not much of a practical or functional value to anyone. (alto I think they are pretty cool to web designers and developers. I am sure there will be detractors out there.) It is just to showcase the use of some ASP.NET objects and how easy it is to use them and also do some simple string manipulation. This project is a HTML Content Parser. It gets a stream of HTML Content from a specified URL Web Page. Then it sets to go through whole stream extracted and picks out the HTML HyperLinks and Images and displays them in an HTML Table in a hyperlink format for users to click on directly to get there. This will be particularly useful for uses who are interested in some images on websites and finds it tedious to look through the view source of the pages to extract out the image sources of the page. Please do check out the live version of this project from my website. Please click here to get there now. Please feel free to post any comments or criticisms on my project and articles. Lets make use of MS.NET's more intuitive OOP features to separate encapsulate and group different functions into classes and assemblies for easier maintanance. This code here goes into a Class called HTMLContentParser.vb '/// Imports System.IO Imports System.Net Imports System Imports System.Text Imports System.Text.RegularExpressions Public Class HTMLContentParser Function Return_HTMLContent(ByVal sURL As String) Dim sStream As Stream Dim URLReq As HttpWebRequest Dim URLRes As HttpWebResponse Try URLReq = WebRequest.Create(sURL) URLRes = URLReq.GetResponse() sStream = URLRes.GetResponseStream() Return New StreamReader(sStream).ReadToEnd() Catch ex As Exception Return ex.Message End Try End Function Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList Dim rRegEx As Regex Dim mMatch As Match Dim aMatch As New ArrayList() rRegEx = New Regex("a.*href/s*=/s*(?:""(?<1>[^""]*)""|(?<1>/S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled) mMatch = rRegEx.Match(sHTMLContent) While mMatch.Success Dim sMatch As String sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL) aMatch.Add(sMatch) mMatch = mMatch.NextMatch() End While Return aMatch End Function Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList Dim rRegEx As Regex Dim mMatch As Match Dim aMatch As New ArrayList() rRegEx = New Regex("img.*src/s*=/s*(?:""(?<1>[^""]*)""|(?<1>/S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled) mMatch = rRegEx.Match(sHTMLContent) While mMatch.Success Dim sMatch As String sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL) aMatch.Add(sMatch) mMatch = mMatch.NextMatch() End While Return aMatch End Function Private Function ProcessURL(ByVal sInput As String, ByVal sURL As String) 'Find out if the sURL has a "/" after the Domain Name 'If not, give a "/" at the end 'First, check out for any slash after the 'Double Dashes of the http:// 'If there is NO slash, then end the sURL string with a SLASH If InStr(8, sURL, "/") = 0 Then sURL += "/" End If 'FILTERING 'Filter down to the Domain Name Directory from the Right Dim iCount As Integer For iCount = sURL.Length To 1 Step -1 If Mid(sURL, iCount, 1) = "/" Then sURL = Left(sURL, iCount) Exit For End If Next 'Filter out the ">" from the Left For iCount = 1 To sInput.Length If Mid(sInput, iCount, 4) = ">" Then sInput = Left(sInput, iCount - 1) 'Stop and Take the Char before Exit For End If Next 'Filter out unnecessary Characters sInput = sInput.Replace("<", Chr(39)) sInput = sInput.Replace(">", Chr(39)) sInput = sInput.Replace(""", "") sInput = sInput.Replace("'", "") If (sInput.IndexOf("http://") < 0) Then If (Not (sInput.StartsWith("/")) And Not (sURL.EndsWith("/"))) Then Return sURL & "/" & sInput Else If (sInput.StartsWith("/")) And (sURL.EndsWith("/")) Then Return sURL.Substring(0, sURL.Length - 1) + sInput Else Return sURL + sInput End If End If Else Return sInput End If End Function End Class '/// The Function getHTMLContent requires a URL parameter input in a string format. From there we use the HTTPWebRequest and HTTPWebResponse objects to send a request to the specified URL and get their HTML Content as a Response. Note the structured error handling implemented here. This structured error handling is explained in a different topic altogether. The returned value should be placed and displayed in a HTML TextBox for retrieval purposes later. The ParseHTMLLinks and Images Functions make use of a Regex object that should be very familiar to Java and C# Developers and would look alien to VB Developers. They are actually a pattern matching object and can be used together with the Match Object. These are all objects of the System.Text.RegularExpressions Namespaces and therefore MUST be imported and declared into the VB.NET class. What they do is essentially a Regex pattern match into the Match object with the HTML Content (retrieved from an earlier HTML TextBox we use to display the retrieved HTML Content) as the source. As and when it finds the matched pattern specified by Regex, it returns the string containing the pattern, process it with ProcessURL Function and then adds it to an ArrayList. The ArrayList class is essentially the Collection class of the classic VB. It has the ability to add and remove from the collection which is far more intuitive and easier to use than the array class. Both the ParseHTMLLinks and Images return an arrayList of Links and Images. The ProcessURL Function here essentially uses some very intrinsic VB functions and some new VB.NET ones (of which I am a developer of for years and therefore am familiar with). I also realized that some detractors out there will propose the use of the stringBuilder class as an immutable class to manipulate strings in this function. What the stringBuilder class differs from the String class is that the stringBuilder class is immutable which means it doesnt create a new instance of itself any time it is referred to. It is therefore more efficient on the machine's resources. The String class creates a new instance of itself whenever it is assigned an expression. and you can imagine the strain on resources when it the same string is manipulated 10 times, it will create 10 new instances of itself. Hardly efficient. I use the string class here because although its much more inefficient, its much more familiar to the VB Developers who are on transition to VB.NET and this topic here, more or less, focus on the ASP.NET HTTPWebRequest and HTTPWebResponse objects. I will save the Stringbuilder class topic for later articles. But I am sure other developers and authors here will and already have explained the stringBuilder class aready. This code here goes to an ASP.NET ASPX page '// Private objParser As HTMLContentParser Private Sub cmdGetHTML_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdGetHTML.ServerClick Dim sURL As String = "http://" & txtURL.Value txtHTMLContent.EnableViewState = False txtHTMLContent.Value = objParser.Return_HTMLContent(sURL) End Sub Private Sub cmdParse_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdParse.ServerClick Call PopulatetblParsedContent() End Sub Private Sub PopulatetblParsedContent() 'Populate Links Table Dim sURL As String = "http://" & txtURL.Value Dim myAnchor As HtmlAnchor Dim intRows As Integer Dim intRowCount As Integer Dim objRow As HtmlTableRow Dim objCell As HtmlTableCell Dim sLinks As String Dim sImage As String Dim lstLinks As ArrayList = objParser.ParseHTMLLinks(txtHTMLContent.Value, sURL) Dim lstImages As ArrayList = objParser.ParseHTMLImages(txtHTMLContent.Value, sURL) tblParsedContent = Me.tblParsedContent tblParsedContent.EnableViewState = False For Each sLinks In lstLinks objRow = New HtmlTableRow() objCell = New HtmlTableCell() myAnchor = New HtmlAnchor() myAnchor.Target = "_blank" myAnchor.InnerText = "Link: " & sLinks.ToString myAnchor.HRef = sLinks.ToString objCell.NoWrap = False objCell.Controls.Add(myAnchor) objRow.Cells.Add(objCell) tblParsedContent.Rows.Add(objRow) Next For Each sImage In lstImages objRow = New HtmlTableRow() objCell = New HtmlTableCell() myAnchor = New HtmlAnchor() myAnchor.Target = "_blank" myAnchor.InnerText = "Img: " & sImage.ToString myAnchor.HRef = sImage.ToString objCell.NoWrap = False objCell.Controls.Add(myAnchor) objRow.Cells.Add(objCell) tblParsedContent.Rows.Add(objRow) Next End Sub '/ We then go now to focus on how to extract the HTML Content and parse them. Design your ASPX page and have 1) A HTMLTextbox for users to specify the URL for processing 2) A HTMLButton called cmdGetHTML with a ServerClick event handler to handle to Click event. The event will trigger a routine that uses the HTMLContentParser class that we had coded earlier and use the getHTMLContent function to return a string of HTMLContent for display into a txtHTMLContent HTMLTextBox. 3) A HTMLTextBox called txtHTMLContent to hold the returned HTML Content 4) A HTMLButton called cmdParse with a ServerClick event handler that calls the PopulatetblParseContent 5) A HTMLTable called tblParsedContent
Personally, I think the HTMLTable server control is amazing. I have used this example here to show its intuitiveness to add cells and rows to it. Again, my detractors out there may question the routine to populate the tables. I say that this is just an article to explain one of the ways to populate a HTMLTable. It is very intuitive and I am sure most developers out there can just understand it without much explanation. It makes use of the HTMLTableRow and HTMLTableCell to add cells into rows and rows into the HTMLTable. Note the use of the For each ... in the ArrayList Collection to extract each link and image, assign them to another server control HTMLAnchor, add this HTMLAnchor to a HTMLTableCell, add the HTMLTableCell to a HTMLTableRow and finally add the HTMLTableRow to a HTMLTable. Very intuitive to program and code ! In the world of software development, the HolyGrail is seldom achieved as there is No One Right Way to do things, however there are Many Wrong Ways. This article here is more or less a tutorial on certain ASP.NET Objects and the intuitiveness of the programmatically of the web server controls. Feel free to modify them at your own pace and leisure to suite your learning curve transition from VB to VB.NET/ASP.NET. Use the StringBuilder class instead of the StringClass in the ProcessURL Function of the HTMLContentParser.vb class and after you are familiar with the program structure of the HTMLTable, use the DataSource and DataBind Techniques of the HTMLTable, the ASPDataList and the ASPDataGrid Server Controls. There is only one way to learn properly and that is from SCRATCH. In that sense, you can fully appreciate why and how you do things. After all, aint the whole world of MS.NET developed from SCRATCH which is much better than patches and add-ons to the imperfections of yesterday.
|