HTMLContentParser ASP.NET Project using VB.NET

This project I post today really has not much of a practical or functional value to anyone. (alto I think they are pretty cool to web designers and developers. I am sure there will be detractors out there.) It is just to showcase the use of some ASP.NET objects and how easy it is to use them and also do some simple string manipulation.

This project is a HTML Content Parser. It gets a stream of HTML Content from a specified URL Web Page. Then it sets to go through whole stream extracted and picks out the HTML HyperLinks and Images and displays them in an HTML Table in a hyperlink format for users to click on directly to get there. This will be particularly useful for uses who are interested in some images on websites and finds it tedious to look through the view source of the pages to extract out the image sources of the page.

Please do check out the live version of this project from my website. Please click here to get there now. Please feel free to post any comments or criticisms on my project and articles.

Lets make use of MS.NET's more intuitive OOP features to separate encapsulate and group different functions into classes and assemblies for easier maintanance.

This code here goes into a Class called
HTMLContentParser.vb
'///
Imports System.IO
Imports System.Net
Imports System
Imports System.Text
Imports System.Text.RegularExpressions
Public Class HTMLContentParser
Function Return_HTMLContent(ByVal sURL As String)
Dim sStream As Stream
Dim URLReq As HttpWebRequest
Dim URLRes As HttpWebResponse
Try
URLReq = WebRequest.Create(sURL)
URLRes = URLReq.GetResponse()
sStream = URLRes.GetResponseStream()
Return New StreamReader(sStream).ReadToEnd()
Catch ex As Exception
Return ex.Message
End Try
End Function
Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList()
rRegEx = New Regex("a.*href/s*=/s*(?:""(?<1>[^""]*)""|(?<1>/S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList()
rRegEx = New Regex("img.*src/s*=/s*(?:""(?<1>[^""]*)""|(?<1>/S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Private Function ProcessURL(ByVal sInput As String, ByVal sURL As String)
'Find out if the sURL has a "/" after the Domain Name 'If not, give a "/" at the end 'First, check out for any slash after the 'Double Dashes of the http:// 'If there is NO slash, then end the sURL string with a SLASH If InStr(8, sURL, "/") = 0 Then
sURL += "/"
End If
'FILTERING
'Filter down to the Domain Name Directory from the Right
Dim iCount As Integer
For iCount = sURL.Length To 1 Step -1
If Mid(sURL, iCount, 1) = "/" Then
sURL = Left(sURL, iCount)
Exit For
End If
Next
'Filter out the ">" from the Left
For iCount = 1 To sInput.Length
If Mid(sInput, iCount, 4) = ">" Then
sInput = Left(sInput, iCount - 1) 'Stop and Take the Char before
Exit For
End If
Next
'Filter out unnecessary Characters
sInput = sInput.Replace("<", Chr(39))
sInput = sInput.Replace(">", Chr(39))
sInput = sInput.Replace(""", "")
sInput = sInput.Replace("'", "")
If (sInput.IndexOf("http://") < 0) Then
If (Not (sInput.StartsWith("/")) And Not (sURL.EndsWith("/"))) Then
Return sURL & "/" & sInput
Else
If (sInput.StartsWith("/")) And (sURL.EndsWith("/")) Then
Return sURL.Substring(0, sURL.Length - 1) + sInput
Else
Return sURL + sInput
End If
End If
Else
Return sInput
End If
End Function
End Class
'///

The Function getHTMLContent requires a URL parameter input in a string format. From there we use the HTTPWebRequest and HTTPWebResponse objects to send a request to the specified URL and get their HTML Content as a Response. Note the structured error handling implemented here. This structured error handling is explained in a different topic altogether. The returned value should be placed and displayed in a HTML TextBox for retrieval purposes later.

The ParseHTMLLinks and Images Functions make use of a Regex object that should be very familiar to Java and C# Developers and would look alien to VB Developers. They are actually a pattern matching object and can be used together with the Match Object. These are all objects of the System.Text.RegularExpressions Namespaces and therefore MUST be imported and declared into the VB.NET class. What they do is essentially a Regex pattern match into the Match object with the HTML Content (retrieved from an earlier HTML TextBox we use to display the retrieved HTML Content) as the source. As and when it finds the matched pattern specified by Regex, it returns the string containing the pattern, process it with ProcessURL Function and then adds it to an ArrayList. The ArrayList class is essentially the Collection class of the classic VB. It has the ability to add and remove from the collection which is far more intuitive and easier to use than the array class. Both the ParseHTMLLinks and Images return an arrayList of Links and Images.

The ProcessURL Function here essentially uses some very intrinsic VB functions and some new VB.NET ones (of which I am a developer of for years and therefore am familiar with). I also realized that some detractors out there will propose the use of the stringBuilder class as an immutable class to manipulate strings in this function. What the stringBuilder class differs from the String class is that the stringBuilder class is immutable which means it doesnt create a new instance of itself any time it is referred to. It is therefore more efficient on the machine's resources. The String class creates a new instance of itself whenever it is assigned an expression. and you can imagine the strain on resources when it the same string is manipulated 10 times, it will create 10 new instances of itself. Hardly efficient. I use the string class here because although its much more inefficient, its much more familiar to the VB Developers who are on transition to VB.NET and this topic here, more or less, focus on the ASP.NET HTTPWebRequest and HTTPWebResponse objects. I will save the Stringbuilder class topic for later articles. But I am sure other developers and authors here will and already have explained the stringBuilder class aready.

This code here goes to an ASP.NET ASPX page '// Private objParser As HTMLContentParser Private Sub cmdGetHTML_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdGetHTML.ServerClick
Dim sURL As String = "http://" & txtURL.Value
txtHTMLContent.EnableViewState = False
txtHTMLContent.Value = objParser.Return_HTMLContent(sURL)
End Sub
Private Sub cmdParse_ServerClick(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles cmdParse.ServerClick
Call PopulatetblParsedContent()
End Sub
Private Sub PopulatetblParsedContent() 'Populate Links Table
Dim sURL As String = "http://" & txtURL.Value
Dim myAnchor As HtmlAnchor
Dim intRows As Integer
Dim intRowCount As Integer
Dim objRow As HtmlTableRow
Dim objCell As HtmlTableCell
Dim sLinks As String
Dim sImage As String
Dim lstLinks As ArrayList = objParser.ParseHTMLLinks(txtHTMLContent.Value, sURL)
Dim lstImages As ArrayList = objParser.ParseHTMLImages(txtHTMLContent.Value, sURL)
tblParsedContent = Me.tblParsedContent
tblParsedContent.EnableViewState = False
For Each sLinks In lstLinks
objRow = New HtmlTableRow()
objCell = New HtmlTableCell()
myAnchor = New HtmlAnchor()
myAnchor.Target = "_blank"
myAnchor.InnerText = "Link: " & sLinks.ToString
myAnchor.HRef = sLinks.ToString
objCell.NoWrap = False
objCell.Controls.Add(myAnchor)
objRow.Cells.Add(objCell)
tblParsedContent.Rows.Add(objRow)
Next
For Each sImage In lstImages
objRow = New HtmlTableRow()
objCell = New HtmlTableCell()
myAnchor = New HtmlAnchor()
myAnchor.Target = "_blank"
myAnchor.InnerText = "Img: " & sImage.ToString
myAnchor.HRef = sImage.ToString
objCell.NoWrap = False
objCell.Controls.Add(myAnchor)
objRow.Cells.Add(objCell)
tblParsedContent.Rows.Add(objRow)
Next
End Sub
'/

We then go now to focus on how to extract the HTML Content and parse them. Design your ASPX page and have
1) A HTMLTextbox for users to specify the URL for processing
2) A HTMLButton called cmdGetHTML with a ServerClick event handler to handle to Click event. The event will trigger a routine that uses the HTMLContentParser class that we had coded earlier and use the getHTMLContent function to return a string of HTMLContent for display into a txtHTMLContent HTMLTextBox.
3) A HTMLTextBox called txtHTMLContent to hold the returned HTML Content
4) A HTMLButton called cmdParse with a ServerClick event handler that calls the PopulatetblParseContent
5) A HTMLTable called tblParsedContent

Personally, I think the HTMLTable server control is amazing. I have used this example here to show its intuitiveness to add cells and rows to it. Again, my detractors out there may question the routine to populate the tables. I say that this is just an article to explain one of the ways to populate a HTMLTable. It is very intuitive and I am sure most developers out there can just understand it without much explanation. It makes use of the HTMLTableRow and HTMLTableCell to add cells into rows and rows into the HTMLTable. Note the use of the For each ... in the ArrayList Collection to extract each link and image, assign them to another server control HTMLAnchor, add this HTMLAnchor to a HTMLTableCell, add the HTMLTableCell to a HTMLTableRow and finally add the HTMLTableRow to a HTMLTable. Very intuitive to program and code !

In the world of software development, the HolyGrail is seldom achieved as there is No One Right Way to do things, however there are Many Wrong Ways. This article here is more or less a tutorial on certain ASP.NET Objects and the intuitiveness of the programmatically of the web server controls. Feel free to modify them at your own pace and leisure to suite your learning curve transition from VB to VB.NET/ASP.NET. Use the StringBuilder class instead of the StringClass in the ProcessURL Function of the HTMLContentParser.vb class and after you are familiar with the program structure of the HTMLTable, use the DataSource and DataBind Techniques of the HTMLTable, the ASPDataList and the ASPDataGrid Server Controls. There is only one way to learn properly and that is from SCRATCH. In that sense, you can fully appreciate why and how you do things. After all, aint the whole world of MS.NET developed from SCRATCH which is much better than patches and add-ons to the imperfections of yesterday.

 
        

winzip iconDownload article
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
完整版:https://download.csdn.net/download/qq_27595745/89522468 【课程大纲】 1-1 什么是java 1-2 认识java语言 1-3 java平台的体系结构 1-4 java SE环境安装和配置 2-1 java程序简介 2-2 计算机中的程序 2-3 java程序 2-4 java类库组织结构和文档 2-5 java虚拟机简介 2-6 java的垃圾回收器 2-7 java上机练习 3-1 java语言基础入门 3-2 数据的分类 3-3 标识符、关键字和常量 3-4 运算符 3-5 表达式 3-6 顺序结构和选择结构 3-7 循环语句 3-8 跳转语句 3-9 MyEclipse工具介绍 3-10 java基础知识章节练习 4-1 一维数组 4-2 数组应用 4-3 多维数组 4-4 排序算法 4-5 增强for循环 4-6 数组和排序算法章节练习 5-0 抽象和封装 5-1 面向过程的设计思想 5-2 面向对象的设计思想 5-3 抽象 5-4 封装 5-5 属性 5-6 方法的定义 5-7 this关键字 5-8 javaBean 5-9 包 package 5-10 抽象和封装章节练习 6-0 继承和多态 6-1 继承 6-2 object类 6-3 多态 6-4 访问修饰符 6-5 static修饰符 6-6 final修饰符 6-7 abstract修饰符 6-8 接口 6-9 继承和多态 章节练习 7-1 面向对象的分析与设计简介 7-2 对象模型建立 7-3 类之间的关系 7-4 软件的可维护与复用设计原则 7-5 面向对象的设计与分析 章节练习 8-1 内部类与包装器 8-2 对象包装器 8-3 装箱和拆箱 8-4 练习题 9-1 常用类介绍 9-2 StringBuffer和String Builder类 9-3 Rintime类的使用 9-4 日期类简介 9-5 java程序国际化的实现 9-6 Random类和Math类 9-7 枚举 9-8 练习题 10-1 java异常处理 10-2 认识异常 10-3 使用try和catch捕获异常 10-4 使用throw和throws引发异常 10-5 finally关键字 10-6 getMessage和printStackTrace方法 10-7 异常分类 10-8 自定义异常类 10-9 练习题 11-1 Java集合框架和泛型机制 11-2 Collection接口 11-3 Set接口实现类 11-4 List接口实现类 11-5 Map接口 11-6 Collections类 11-7 泛型概述 11-8 练习题 12-1 多线程 12-2 线程的生命周期 12-3 线程的调度和优先级 12-4 线程的同步 12-5 集合类的同步问题 12-6 用Timer类调度任务 12-7 练习题 13-1 Java IO 13-2 Java IO原理 13-3 流类的结构 13-4 文件流 13-5 缓冲流 13-6 转换流 13-7 数据流 13-8 打印流 13-9 对象流 13-10 随机存取文件流 13-11 zip文件流 13-12 练习题 14-1 图形用户界面设计 14-2 事件处理机制 14-3 AWT常用组件 14-4 swing简介 14-5 可视化开发swing组件 14-6 声音的播放和处理 14-7 2D图形的绘制 14-8 练习题 15-1 反射 15-2 使用Java反射机制 15-3 反射与动态代理 15-4 练习题 16-1 Java标注 16-2 JDK内置的基本标注类型 16-3 自定义标注类型 16-4 对标注进行标注 16-5 利用反射获取标注信息 16-6 练习题 17-1 顶目实战1-单机版五子棋游戏 17-2 总体设计 17-3 代码实现 17-4 程序的运行与发布 17-5 手动生成可执行JAR文件 17-6 练习题 18-1 Java数据库编程 18-2 JDBC类和接口 18-3 JDBC操作SQL 18-4 JDBC基本示例 18-5 JDBC应用示例 18-6 练习题 19-1 。。。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值