WebWagon - An HTML Container Class

 The Microsoft HTML Document Class, which comes with the Web Browser Control or MSHTML Object Library – provides a rich and featured set of classes for retrieving and manipulating HTML pages. However there are a number of drawbacks to using this class when writing an Internet mining application such as a web crawler:

  • Overhead: Images will be automatically downloaded. This provides unwanted overhead when the task it hand is simply to examine or retrieve the contents of the page itself.
  • Dialog Boxes: Any page encountered with an unknown font, required password or any such problem will cause a dialog box to display, halting the application until it is closed.
  • Hanging: Especially bad since Windows 2000 - I have had trouble with sessions that suddenly stop responding unless stopped and restarted. 

In this article, we will develop a simple, light-weight HTML Container class that can be used for retrieving and processing web pages. We will look at how to grab an HTML document from the Web and how the parse the information once it has been retrieved.

Requesting a Page from the Web 

The basic functions provided by the HTML Container class will be the ability to load source from a URL and parse the loaded HTML. Since we need something to parse before we can parse it, a look at how to request a page from the Web would seem to be the logical first step.

We will use the HttpWebRequest and HttpWebResponse classes, which are part of the System.Net namespace. A multi-step process is necessary to request and load a page from the Net. First, a web request is initiated by invoking the Create method, which is passed a URL and returns back an HttpWebRequest object. We then invoke the GetResponse method of this object to return an HttpWebResponse object. Now GetResponseStream method is called to return a Stream object, which can be read using a StreamReader. Once all the appropriate classes have been instantiated, we begin a loop using the Read method of the StreamReader to load 256 bytes at a time until we have reached the end of the stream. If that weren’t enough, we are still tasked with the job of converting our character array to a string object, which will then be concatenated to another string as each pass is read. The basic process is listed below:

VB.Net:  

 

 Dim hrqURL As System.Net.HttpWebRequest _

     = System.Net.HttpWebRequest.Create("http://www.bbc.co.uk")

 Dim hrspURL As System.Net.HttpWebResponse _

     = hrqURL.GetResponse()

 Dim srdrInput _

     As New System.IO.StreamReader(hrspURL.GetResponseStream)

 Dim chrBuff(255) As Char

 Dim intLen As Integer

 Dim strSource As String

 Do

     intLen = srdrInput.Read(chrBuff, 0, 256)

     Dim strBuff As New String(chrBuff, 0, intLen)

     strSource = strSource & strBuff

 Loop While (intLen)

 

 m_strSource = strSource

C#:

 

 System.Net.HttpWebRequest hrqURL

     = (System.Net.HttpWebRequest)

 System.Net.HttpWebRequest.Create("http://www.bbc.co.uk");

 System.Net.HttpWebResponse hrspURL 

     = (System.Net.HttpWebResponse) hrqURL.GetResponse();

 System.IO.StreamReader srdrInput 

     = new System.IO.StreamReader(hrspURL.GetResponseStream());

 char[] chrBuff = new char[256];

 int intLen = 0;

 string strSource = "";

 do

 {

    intLen = srdrInput.Read(chrBuff, 0, 256);

    string strBuff =  new string(chrBuff, 0, intLen);

    strSource = strSource + strBuff;

 }

 while (intLen > 0);

 

 m_strSource = strSource;

 

You can save a little typing by using the “Imports” statement in VB.Net or the “using” statement for C# to declare the namespaces at the beginning of the source code. This will allow you to skip typing the fully qualified names when using the classes. For this project, we will also want to import the System.Text and System.Text.RegularExpressions namespaces. Additionally, the C# version uses the System.Collections, which contains the ArrayList class, which will compensate for the missing ReDim statement, which is not supported.

 

After loading the source, we save it to the private variable, m_strSource which will be exposed to the object owner via a Source property. This property will be both read and write, so as to provide a way of loading the class directly from text. There are also some useful properties in the HTTPRequest object, like the host name and content type that will be populated upon a successful load as we shall see later on. These need to be cleared when the Source is set directly:

 

VB.Net:

 

    Public Property Source() As String

        Get

            Source = m_strSource

        End Get

        Set(ByVal Value As String)

            m_strSource = Value

            m_strHost = ""

            m_strCharacterSet = ""

            m_strContentEncoding = ""

            m_lngContentLength = 0

            m_strContentType = ""

            m_strLastModified = ""

        End Set

    End Property

 

C#:

 

       public string Source

       {

           get

           { return(m_strSource); }

           set

           {

              m_strSource = value;

              m_strHost = "";

              m_strCharacterSet = "";

              m_strContentEncoding = "";

              m_lngContentLength = 0;

              m_strContentType = "";

              m_strLastModified = "";

           }

       }

  

Parsing the HTML Document:

 

       Now that we have a way to populate the class, let’s take a look at making sense of what’s there. If you are not familiar with Regular Expressions, you will have to operate a bit on faith here. There have been numerous books written on regular expressions, and even a quick look at the basics is beyond the scope of what space permits here. However, we can take a look at a very simple example that will at least whet your appetite for more. We will use several regular expressions to parse the HTML Document, but the simplest is the following:

  <[^>]>

 

This tells the regular expression engine to find a pattern that begins with a less-than sign, ends with a       greater-than sign and includes everything in between. This regular expression will match any tag. Now let’s look at matching an entire block, such a <HEAD></HEAD> entry. We can match everything in the HEAD block with the following regular expression:

 

  <HEAD[^>]*>.*</HEAD/s*>

 

       A special note is warranted here. You need to be aware of the conventions of the language in question when assigning strings such as the one above. C# will normally interpret the “/s” portion as an escape sequence. Embedded quotes are also a potential issue with either VB.Net or C#. With C#, it is possible to use an extra slash (“//s”) but it is generally easier to prefix the string with an at-sign (“@”), which will tell C# to ignore escape sequences – similar to how a quoted string is encoded by VB.Net. The following table will illustrates the point: 

 

Example:

 

Desired Regular Expression

/s”/n

C# (Method 1)

“//s/”//n”

C# (Method 2)

@”/s””/s”

VB.Net

“/s””/n”

 

Using the power of regular expressions, we can extract the raw text from our document using only a few lines of code. This type of regular expression for the HEAD tag will work for most but not all HTML tags. Some tags do not need a closing tag – such as <BR>. Other tags, such as a comment tag use a closing tag which is not in the standard form – a slash followed by the tag name. We will deal with the problem of non-closed blocks later, but for now we will write a function that will accept a tag name, returning the proper regular expression that will match that tag’s contents, assuming there is a corresponding closing tag:

 

VB.Net:

 

    Private Function GetExpressionForTagContents _

      (ByVal strTagName As String) As String

        Dim strPatternTag As String

        If strTagName = "!" Then

            strPatternTag = "<!.*?-->"

        ElseIf StrComp(strTagName, "!doctype", CompareMethod.Text) _

            = 0 Then

            strPatternTag = "<!doctype.*?>"

        ElseIf StrComp(strTagName, "br", CompareMethod.Text) = 0 Then

            strPatternTag = "<br/s*/?/s*>"

        Else

            strPatternTag _

              = "<(" & strTagName & ")(>|/s+[^>]*>).*?<//1/s*>"

        End If

        GetExpressionForTagContents = strPatternTag

    End Function

 

C#:

 

private string GetExpressionForTagContents

    (string strTagName)

{

    string strPatternTag;

    if (strTagName == "!")

       strPatternTag = "<!.*?-->";

    else

       if (string.Compare(strTagName, "!doctype", true) == 0)

       strPatternTag = "<!doctype.*?>";

    else

       if (String.Compare(strTagName, "br", true) == 0)

       strPatternTag = @"<br/s*/?/s*>";

    else

       strPatternTag

           = @"<(" + strTagName + @")(>|/s+[^>]*>).*?<//1/s*>";

    return(strPatternTag);

}

 

       If there no closing tag is present, this match will fail and if so, we will take everything that lies between the tag in question and the less-than sign – or end of file. Let’s put all of this together and write a routine that will accept any tag name, and return the contents closing tag or not:

 

VB.Net:

 

  Private Function GetTagByName _

    (ByVal strTagName As String, ByVal strSource As String) As String

     Dim strPatternTag As String = _

       GetExpressionForTagContents(strTagName)

     Dim strPatternTagNoClose = "<" & strTagName & "(>|/s+[^>]*>)[^<]"

     Dim r As New Regex("")

     Dim opts As RegexOptions _

       = RegexOptions.IgnoreCase _

         Or RegexOptions.Singleline

     Dim m As Match

     Dim mc As MatchCollection

     Dim strGetTagByName As String

 

     m = r.Match(strSource, strPatternTag, opts)

     If m.Value = "" Then

         m = r.Match(strSource, strPatternTagNoClose, opts)

         If m Is Nothing Then

             strGetTagByName = strSource

         Else

             strGetTagByName = m.Value

         End If

     Else

         strGetTagByName = m.Value

     End If

     GetTagByName = strGetTagByName

  End Function

 

C#:

 

    private string GetTagByName 

       (string strTagName, string strSource)

    {

       string strPatternTag

           =   GetExpressionForTagContents(strTagName);

       string strPatternTagNoClose

           = "<" + strTagName + @"(>|/s+[^>]*>)[^<]";

       RegexOptions opts

           = RegexOptions.IgnoreCase | RegexOptions.Singleline;

       Match m;

       string strGetTagByName;

 

       m = System.Text.RegularExpressions.Regex.Match

           (strSource, strPatternTag, opts);

       if (m.Value == "")

       {

           m = System.Text.RegularExpressions.Regex.Match

              (strSource, strPatternTagNoClose, opts);

           if (m == null)

              strGetTagByName = strSource;

           else

              strGetTagByName = m.Value;

       }

       else

           strGetTagByName = m.Value;

       return(strGetTagByName);

    }

 

       We can now pull out any single tag or the first of multiple tags given its name. As you may have noticed, the above function was declared as private so it is for in-class use only. Soon we will write a public function called GetTagsByName that will return an array of tag contents – which will be suitable for returning the data for any tag whether it is found one time or many times. First, let’s add properties to return the HEAD, TITLE and BODY. For the HEAD and TITLE properties, we can just return the results of GetTagByName, passing the appropriate tag name. For the BODY, we need to be a bit more careful. It is possible that we could encounter a document with no HEAD or BODY tag at all. In such a case, GetTagByName will return a null string in which case we will simply return the entire document:

 

VB.Net:

 

    Public ReadOnly Property Head() As String

        Get

            Head = GetTagByName("Head", m_strSource)

        End Get

    End Property

 

    Public ReadOnly Property Title() As String

        Get

            Title = GetTagByName("Title", m_strSource)

        End Get

    End Property

 

    Public ReadOnly Property Body() As String

        Get

            Dim strBody As String _

             = GetTagByName("Body", m_strSource)

            If strBody = "" Then

                strBody = Me.Source

            End If

            Body = strBody

        End Get

    End Property

 

 

C#:

 

    public string Head

    {

       get

       {

           return(GetTagByName("Head", m_strSource));

       }

    }

 

    public string Title

    {

       get

       {

           return(GetTagByName("Title", m_strSource));

       }

    }

 

    public string Body

    {

       get

       {

           string strBody

              = GetTagByName("Body", m_strSource);

           if (strBody == "") 

              strBody = this.Source;

           return(strBody);

       }

    }

 

Extracting the Text:

 

       We now have everything to pull out the raw text from the document. The text will be found in the BODY portion of the document so we will start with this. Then any comments plus anything found between SCRIPT blocks will be stripped. Next we will remove all remaining tags. Everything left over should be the textual content of the page.

 

       Having extracted the text from the surrounding tags, we still have the possibility of encoded ISO characters like the famous &NBSP;. We will write a routine (not listed) to search for each of these and replace the ISO character with its corresponding ASCII equivalent. Lastly, use a while loop to remove any repeated blanks. The code listing for the Text property as well as a helper function to strip the comments appears below:

 

VB.Net:

 

    Public ReadOnly Property Text() As String

        Get

            Dim r As New Regex("")

            Dim opts As RegexOptions _

              = RegexOptions.IgnoreCase Or RegexOptions.Singleline

            Dim strText As String = Me.Body

            strText = StripComments(strText)

            Dim strPattern As String _

              = GetExpressionForTagContents("SCRIPT")

            strText = r.Replace(strText, strPattern, "", opts)

            strPattern = "<[^>]*>"

            strText = r.Replace(strText, strPattern, " ", opts)

            strText = ISOtoASCII(strText)

            strText = Regex.Replace(strText, "&amp;", "&", opts)

            Dim m As MatchCollection

            Do

                strText = r.Replace(strText, "/s/s", " ")

                m = r.Matches(strText, "/s/s")

            Loop Until m.Count = 0

            Text = Trim(strText)

        End Get

    End Property

 

    Private Function StripComments(ByVal strSource As String) As String

        Dim r As New Regex( _

          GetExpressionForTagContents("!"))

        StripComments = r.Replace(strSource, "")

    End Function

 

C#:

 

 public string Text

 {

    get

    {

       Regex r = new Regex("");

       RegexOptions opts

           = RegexOptions.IgnoreCase | RegexOptions.Singleline;

       string strText = this.Body;

       strText = StripComments(strText);

       string strPattern = GetExpressionForTagContents("SCRIPT");

       strText = Regex.Replace(strText, strPattern, "", opts);

       strPattern = @"<[^>]*>";

       strText = Regex.Replace(strText, strPattern, " ", opts);

       strText = ISOtoASCII(strText);

       strText = Regex.Replace(strText,"&amp;","&",opts);

       System.Text.RegularExpressions.MatchCollection m;

       do

       {

           strText = Regex.Replace(strText, @"/s/s", " ");

           m = Regex.Matches(strText, @"/s/s");

       }

       while (m.Count > 0);

 

       return(strText.Trim());

    }

 }

 

 string StripComments()

 {

    Regex r

       = new Regex(GetExpressionForTagContents("!"));

    return( r.Replace(m_strSource, ""));

 }

 

Iterating the Tags:

 

       We are now ready to write GetTagsByName, which will accept a tag name and return an array containing the contents of all matching items. We initially might be tempted to re-write the single tag routine GetTagByName to use a Match Collection rather than a single Match, returning the value of each item. This is not going to work. Consider the following HTML snippet:

 

<UL>

 <UL>

  <UL>

    <LI>Hello Mars!

  </UL>

 </UL>

</UL>

 

       If we want to return the <UL> tag contents, the regular expression used by GetTagByName will return the entire snippet. This is correct for the first UL tag’s contents – but the remaining <UL> tags will be ignored. To solve this problem, we will first create a Match Collection that matches just the opening tag rather than the entire contents – the <UL> tag itself in this case. Then we will iterate through each of these matches, calling GetTagByName – but passing only the portion of the document that begins at the point of that particular match. Let’s take a look at how to put it all together. GetTagsByName will begin as usual by stripping out comments. Of course we will only want to do this if the tag being requested is not itself a comment. Next we will create a Match Collection by applying a regular expression that will return a Match object for each matching item. Finally, we will loop through the collection, calling GetTagByName for each matching tag. Note that the Index property of the Match object will indicate the position of the match – so it is a simple matter to pass only the portion of the document that begins where that tag begins:

 

VB.Net:

 

    Public Function GetTagsByName(ByVal TagName As String, _

      ByVal Source As String) As String()

        Dim r As New Regex("")

        Dim opts As RegexOptions _

          = RegexOptions.IgnoreCase Or RegexOptions.Singleline

        Dim strPattern As String

        If TagName <> "!" Then

            Source = StripComments(Source)

            strPattern = "<(?<TagName>" & TagName & ")(>|/s+[^>]*>)"

        Else

            strPattern = "<(?<TagName>" + TagName + ")--"

        End If

        Dim mc As MatchCollection _

            = r.Matches(Source, strPattern, opts)

        Dim m As Match

        Dim strTagContents() As String

        Dim intIndex As Integer = 0

        For Each m In mc

            ReDim Preserve strTagContents(intIndex)

            strTagContents(intIndex) _

              = GetTagByName(TagName, _

              Mid(Source, m.Groups("TagName").Index))

            intIndex = intIndex + 1

        Next

        GetTagsByName = strTagContents

    End Function

 

C#:

 

 public string[] GetTagsByName(string TagName

    , string Source)

 {

    RegexOptions opts

       = RegexOptions.IgnoreCase | RegexOptions.Singleline;

    string strPattern;

    if (TagName != "!")

    {

       Source = StripComments(Source);

       strPattern

           = "<(?<TagName>" + TagName + @")(>|/s+[^>]*>)";

    }

    else

       strPattern

           = "<(?<TagName>" + TagName + @")--";

    MatchCollection mc

       = Regex.Matches(Source, strPattern, opts);

    ArrayList strTagContents = new ArrayList();

    foreach(Match m in mc)

    {

       strTagContents.Add(GetTagByName(TagName,

           Source.Substring

            (m.Groups["TagName"].Index-1)));

    }

    return((string[]) strTagContents.ToArray(typeof(String)));

}

 

       Another useful function to add would be one to return all HRefs on each page. We could write a routine to retrieve all the anchor tags, and then pull out the HRef portion, but we can get a little better throughput by writing a specific regular expression especially for this purpose:

 

  <a[^>]*href/s*=/s*""?(?<HRef>[^"">/s]*)""?[^>]*>

 

       The ?<HRef> in the expression will capture the appropriate portion of the anchor tag and assign it a group name of HRef, allowing us to refer to it later using the syntax:

 

  m.Groups("HRef").Value

 

We will also go ahead and normalize the HRefs as well by prefixing the server or protocol if missing. So, if the current page is www.bbc.co.uk, an entry such as:

  /MoreStuff.html

      

       Will be normalized to:

  http://www.bbc.co.uk/MoreStuff.html

 

      The normalization will be skipped for any null tags or tags beginning with the “#” character. Here is the complete listing:

 

VB.Net:

 

    Public Function GetHRefs() As String()

        Dim strSource As String = StripComments(m_strSource)

        Dim r As New Regex( _

          "<a[^>]*href/s*=/s*""?(?<HRef>[^"">/s]*)""?[^>]*>", _

           RegexOptions.IgnoreCase Or RegexOptions.Singleline)

        Dim mc As MatchCollection _

            = r.Matches(Source)

        Dim m As Match

        Dim intIndex As Integer = 0

        Dim strHRefs() As String

        For Each m In mc

            Dim strHRef = Trim(m.Groups("HRef").Value)

            If strHRef <> "" Then

                If Left(strHRef, 1) <> "#" Then

                    If Left(strHRef, 1) = "/" Then

                        strHRef = m_strServerURL & strHRef

                    ElseIf StrComp(Left(strHRef, 7), "http://", _

                        CompareMethod.Text) <> 0 Then

                        If StrComp(Left(strHRef, 3), "www", _

                         CompareMethod.Text) = 0 Then

                            strHRef = "http://" & strHRef

                        Else

                            strHRef = m_strPathURL & strHRef

                        End If

                    End If

                End If

            End If

            ReDim Preserve strHRefs(intIndex)

            strHRefs(intIndex) = strHRef

            intIndex = intIndex + 1

        Next

        GetHRefs = strHRefs

    End Function

 

C#:

 

 public string[] GetHRefs()

 {

    string strSource = StripComments(m_strSource);

    Regex  r = new

       Regex(@"<a[^>]*href/s*=/s*""?(?<HRef>[^"">/s]*)""?[^>]*>",

RegexOptions.IgnoreCase | RegexOptions.Singleline);

           MatchCollection mc = r.Matches(Source);

       ArrayList strHRefs = new ArrayList();

       foreach(Match m in mc)

       {

         string strHRef =  m.Groups["HRef"].Value;

         strHRef.Trim();

         if (strHRef != "")

           if (Left(strHRef,1) != "#")

              if (Left(strHRef,1) == "/")

                  strHRef = m_strServerURL + strHRef;

              else

              {

                  if (String.Compare(Left(strHRef,7),

                   "http://", true) != 0)

                  if (String.Compare(Left(strHRef, 3),

                            "www", true) == 0)

                     strHRef = "http://" + strHRef;

                  else

                     strHRef = m_strPathURL + strHRef;

              }

              strHRefs.Add(strHRef);

           }

    return(string[]) strHRefs.ToArray(typeof(string)); 

 }

 private string Left(string strString, int intLen)

 {

    if (strString.Length <= intLen)

       return(strString);

    else

       return(strString.Substring(0,intLen));

}

 

Let’s Have Some Feedback!

 

I am a stickler for feedback. I am one of those people who think there are few things worse in life than waiting for a computer. Assuming we have done our best to provide reasonable response time – the next best thing to do is to inform the user with status information when available. Our vulnerable area in this project is during the LoadSource routine which is dependant on the whims of the Internet Goddess. Let’s add a couple of events that we can use to pass back status information to the user. One will fire during the loop that loads the content from the StreamReader, indicating the number of bytes loaded so far along with the total length. These can be subsequently used to populate the Value and Maximum properties of a Progress Bar. The second event will pass back textual status information. Here is the full LoadSource function, complete with the added events and a timeout feature, which should always be present in such a loop. Note that the content length is not always returned by the server, so a default length is used if none is available:

 

VB.Net:

 

 Public Function LoadSource(ByVal URL As String) As Boolean

 

     m_strURL = URL

     Const DEFAULT_CONTENT_LENGTH As Integer = 40000

     Dim strSource As String = ""

     Dim strHost As String = ""

     Dim strServerURL As String = ""

     Dim strPathURL As String = ""

     Dim strCharacterSet As String = ""

     Dim strContentEncoding As String = ""

     Dim lngContentLength As Long = 0

     Dim strContentType As String = ""

     Dim strServer As String = ""

     Dim strLastModified As String = ""

     Dim intTotalLength As Integer

 

     If m_strURL = "" Then

         RaiseEvent LoadStatus(m_strURL, "Error")

         LoadSource = False

     Else

         Try

             RaiseEvent LoadStatus(m_strURL, "Request")

             RaiseEvent LoadProgress(m_strURL, 0, 0)

             Dim hrqURL As HttpWebRequest _

               = HttpWebRequest.Create(m_strURL)

             Dim hrspURL As HttpWebResponse = hrqURL.GetResponse()

             Dim srdrInput _

               As New StreamReader(hrspURL.GetResponseStream())

             Dim chrBuff(255) As Char

             Dim intLen As Integer

 

              If lngContentLength <= 0 Then

                 lngContentLength = DEFAULT_CONTENT_LENGTH

             End If

             RaiseEvent LoadStatus(m_strURL, "Load")

 

             Dim tmeExpire As DateTime _

               = DateAdd(DateInterval.Second, _

              m_intTimeOutSeconds, Now)

 

              Do

                 intLen = srdrInput.Read(chrBuff, 0, 256)

                 Dim strBuff As New String(chrBuff, 0, intLen)

                 strSource = strSource & strBuff

                 intTotalLength = intTotalLength + intLen

                 If intTotalLength > lngContentLength Then

                     lngContentLength = 2 * intTotalLength

                 End If

                 RaiseEvent LoadProgress _

                   (m_strURL, lngContentLength, intTotalLength)

                 If DateDiff(DateInterval.Second, tmeExpire, Now) _

                   > 0 Then

                     RaiseEvent LoadStatus(m_strURL, "Error")

                     LoadSource = False

                     Exit Do

                 End If

             Loop While intLen

             srdrInput.Close()

             hrspURL.Close()

             With hrspURL

                 strHost = .ResponseUri.Host

 

                 Dim m As Match _

                     = Regex.Match(.ResponseUri.AbsoluteUri, _

                       "/", RegexOptions.RightToLeft)

                 If m Is Nothing Then

                     strPathURL = .ResponseUri.AbsoluteUri & "/"

                 Else

                     strPathURL _

                     = .ResponseUri.AbsoluteUri.Substring(0, m.Index) _

                        & "/"

                 End If

 

                 m = Regex.Match(.ResponseUri.AbsoluteUri, _

                       strHost, RegexOptions.RightToLeft _

                         Or RegexOptions.IgnoreCase)

                 If m Is Nothing Then

                     strServerURL = .ResponseUri.AbsoluteUri

                 Else

                     strServerURL _

                      = .ResponseUri.AbsoluteUri.Substring _

                        (0, m.Index + strHost.Length)

                 End If

 

                 strCharacterSet = .CharacterSet

                 strContentEncoding = .ContentEncoding

                 lngContentLength = .ContentLength

                 strContentType = .ContentType

                 strLastModified = .LastModified

             End With

             RaiseEvent LoadStatus(m_strURL, "Complete")

             LoadSource = True

         Catch

             RaiseEvent LoadStatus(m_strURL, "Error")

             LoadSource = False

         End Try

     End If

 

        m_strHost = strHost

        m_strServerURL = strServerURL

        m_strPathURL = strPathURL

        m_strSource = strSource

        m_strCharacterSet = strCharacterSet

        m_strContentEncoding = strContentEncoding

        m_lngContentLength = lngContentLength

        m_strContentType = strContentType

        m_strLastModified = strLastModified

 

        RaiseEvent LoadProgress(m_strURL, intTotalLength, _

            intTotalLength)

 

 End Function

 

C#:

 

public bool LoadSource(string URL)

{

  m_strURL = URL;

 

  const int DEFAULT_CONTENT_LENGTH = 40000;

  string strSource = "";

  string strHost = "";

  string strServerURL  = "";

  string strPathURL  = "";

  string strCharacterSet = "";

  string strContentEncoding = "";

  long lngContentLength   = 0;

  string strContentType   = "";

  string strLastModified  = "";

  int intTotalLength   = 0;

 

  if (m_strURL == "")

  {

    if (LoadStatus != null)

       LoadStatus(m_strURL, "Error");

    return(false);

  }

  else

  {

    try

    {

       if (LoadStatus != null)

           LoadStatus(m_strURL, "Request");

       if (LoadProgress != null)

           LoadProgress(m_strURL, 0, 0);

       HttpWebRequest hrqURL 

           = (HttpWebRequest)

              HttpWebRequest.Create(m_strURL);

       HttpWebResponse hrspURL 

           = (HttpWebResponse)

              hrqURL.GetResponse();

       StreamReader srdrInput

           = new StreamReader

              (hrspURL.GetResponseStream());

       char[]chrBuff = new char[256];

       int intLen;

           if (lngContentLength <= 0) 

           lngContentLength = DEFAULT_CONTENT_LENGTH;

       if (LoadStatus != null)

           LoadStatus(m_strURL, "Load");

           DateTime tmeExpire

           = new DateTime(DateTime.Now.Ticks);

       tmeExpire

           = tmeExpire.AddSeconds(m_intTimeOutSeconds);

           do

       {

           intLen = srdrInput.Read(chrBuff, 0, 256);

           string strBuff = new string(chrBuff, 0, intLen);

           strSource = strSource + strBuff;

           intTotalLength = intTotalLength + intLen;

           if (intTotalLength > lngContentLength)

              lngContentLength = 2 * intTotalLength;

           if (LoadProgress != null)

              LoadProgress(m_strURL, lngContentLength,

                  intTotalLength);

           if (System.DateTime.Compare(tmeExpire,

                  System.DateTime.Now ) < 0) 

           {

              if (LoadStatus != null)

                  LoadStatus(m_strURL, "Error");

              return(false);

           } 

       }

       while(intLen>0);

 

       srdrInput.Close();

       hrspURL.Close();

 

       strHost = hrspURL.ResponseUri.Host;

 

            Match m

                = Regex.Match(hrspURL.ResponseUri.AbsoluteUri, 

                    "/", RegexOptions.RightToLeft);

            if (m == null)

                strPathURL = hrspURL.ResponseUri.AbsoluteUri + "/";

            else

                strPathURL

                    = hrspURL.ResponseUri.AbsoluteUri.Substring_

                   (0, m.Index) + "/";

 

            m = Regex.Match(hrspURL.ResponseUri.AbsoluteUri, 

                    strHost, RegexOptions.RightToLeft 

                    | RegexOptions.IgnoreCase);

            if (m == null)

                strServerURL = hrspURL.ResponseUri.AbsoluteUri;

            else

                strServerURL 

                    = hrspURL.ResponseUri.AbsoluteUri.Substring 

                    (0, m.Index + strHost.Length);

 

       strCharacterSet = hrspURL.CharacterSet;

       strContentEncoding = hrspURL.ContentEncoding;

       lngContentLength = hrspURL.ContentLength;

       strContentType = hrspURL.ContentType;

       strLastModified = hrspURL.LastModified.ToString();

       if (LoadStatus != null)

           LoadStatus(m_strURL, "Complete");

    }

    catch

    {

       if (LoadStatus != null)

           LoadStatus(m_strURL, "error");

       return(false);

    }

  }

  m_strHost = strHost;

  m_strServerURL = strServerURL;

  m_strPathURL = strPathURL;

  m_strSource = strSource;

  m_strCharacterSet = strCharacterSet;

  m_strContentEncoding = strContentEncoding;

  m_lngContentLength = lngContentLength;

  m_strContentType = strContentType;

  m_strLastModified = strLastModified;

 

  if (LoadProgress != null)

    LoadProgress(m_strURL, intTotalLength, intTotalLength);

  return(true);

}

 

Note by :yungangwu

<script type="text/javascript"> </script> <script src="http://pagead2.googlesyndication.com/pagead/show_ads.js" type="text/javascript"></script>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值