Screen Scraping, ViewState, and Authentication using ASP.Net

Before web services came along, screen scraping was a popular technique for grabbing the output from another application by examining the text it displays on the screen. For web applications, this meant making a request to a URL and examining the HTML the server returns. You could then parse the HTML to grab the latest news headlines or stock quotes from a news site, or the price of a book on amazon.com.

With RSS, XML, and Web Services, the need to screen scrape has diminished, but is not extinct. In this article we will examine a few methods to grab the HTML from another URL and for display in your own page.

HttpServerUtility

If the page you need to fetch is part of the current web application, you can use the execute method on the Server object of the current page. The Server object is of type HttpServerUtility, which also includes the well-known methods Transfer and MapPath. Using execute is straightforward:

TextWriter textWriter = new StringWriter();
Server.Execute("myOtherPage.aspx", textWriter);
Response.Output.Write(textWriter.ToString());

You can use Server.Execute to add content to frames, or devise print friendly pages. We generally would not want to write the entire contents of the resulting string into the response as we have in this sample, but instead would parse select content from myOtherPage.aspx. Of course, we are not always so lucky to have the resource we need inside of the same web application, and this is where classes from the System.Net namespace come into play.

WebClient

The WebClient class presents the simplest API possible for retrieving content from a URL, as seen below.

using(WebClient webClient = new WebClient())
{
   byte[] response = webClient.DownloadData(THEURL);
   Response.OutputStream.Write(response, 0, response.Length);   
}

We need only three lines of code, but this time instead of passing the name of an ASPX page inside of our application, we can pass the URL to a remote resource, like http://www.OdeToCode.com/default.aspx.

The next hurdle you might face is retrieving content from a web site requiring forms authentication. Forms authentication usually requires a user to enter credentials into a form and press a submit button. Pressing submit will cause the browser to perform an HTTP “POST” and send the form values, such as the username and password, in the message body to the server (for more information on GET and POST see the resource section at the bottom of the article).

As an example, consider the source code for the following login form:

<form name="Form1" method="post" action="login.aspx" id="Form1">
<P>Username
    <input name="UsernameTextBox" type="text" id="UsernameTextBox" /></P>
<P>Password
    <input name="PasswordTextBox" type="text" id="PasswordTextBox" /></P>
<P>
    <input type="submit" name="LoginButton" value="Login" id="LoginButton" /></P>
</form>

In the message body of the browser POST, the form values could appear like so:

UsernameTextBox=scott&PasswordTextBox=scott&LoginButton=Login

When this payload arrives at the server, the code will know the user entered ‘scott’ into the username textbox, ‘scott’ in the password text box, and posted the form using the Login button. We can use the WebClient class to simulate a POST for this form with the following code.

WebClient webClient = new WebClient();                  
webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
byte[] response = webClient.UploadData(
      LOGIN_URL, "POST", Encoding.ASCII.GetBytes(postData)
   );

However, trying to POST to an ASP.NET page will usually involve one more obstacle: the Viewstate. We will not be covering Viewstate in this article (see resources below), except we need to know how to correctly POST the Viewstate back to the server. ASP.NET sends Viewstate to the client in a hidden form field, and we must parse out the correct value in order to submit the login form programmatically. If we view the source for a login web form like the form above in ASP.NET, we will see the following appear just after the opening form tag:

<input type="hidden" name="__VIEWSTATE"
 value="dDwtMzg4MDA0NzA7Oz5c3QucjNFeAIFsjceZk8ndLkr4yA==" /> 

You might be asking what else might appear in a form, and what is the easiest way to see what the browser sends to the server? If you are going to do any nontrivial screen-scraping, sooner or later you will need to answer this question and debug problems. The easiest way to debug is to use a tool like Fiddler, which will show you every request and response between your machine and a web server. You can inspect the headers and message content, and watch exactly what happens when your browser performs a POST, then try to replicate the behavior programmatically.

In order to send the correct Viewstate value to the server, we will first need to request the form from the server, parse the Viewstate, and then POST the form back. Let’s try this in our next example.

byte[] response;

WebClient webClient = new WebClient();
response = webClient.DownloadData(LOGIN_URL);

string viewstate = ExtractViewState(
      Encoding.ASCII.GetString(response)
   );

string postData = String.Format(
   "__VIEWSTATE={0}&UsernameTextBox={1}&PasswordTextBox={2}&LoginButton=Login",
   viewstate, USERNAME, PASSWORD);

webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");
response = webClient.UploadData(
        LOGIN_URL, "POST", Encoding.ASCII.GetBytes(postData)
    );

Now we have a lot more activity happening. First, we request the login form, then we parse out the Viewstate value (more on this coming up). Once we have the Viewstate, we can create a string (postData) with the form values. We have not mentioned the reason for adding the Content-Type header, but if you use the Fiddler tool this will be one of those small details you might notice as a difference between your programmatic POST and the browser POST, and is required for POST to work.

We can parse out the Viewstate value with some string manipulation. First, we will find the location of the identifier __VIEWSTATE, then identify the string after the identifier and between the double quotes of the value attribute.

private string ExtractViewState(string s)
{
   string viewStateNameDelimiter = "__VIEWSTATE";
   string valueDelimiter = "value=/"";
            
   int viewStateNamePosition = s.IndexOf(viewStateNameDelimiter);     
   int viewStateValuePosition = s.IndexOf(
         valueDelimiter, viewStateNamePosition
      );

   int viewStateStartPosition = viewStateValuePosition + 
                                valueDelimiter.Length;
   int viewStateEndPosition = s.IndexOf("/"", viewStateStartPosition);

   return HttpUtility.UrlEncodeUnicode(
            s.Substring(
               viewStateStartPosition, 
               viewStateEndPosition - viewStateStartPosition
            )
         );  
}

Notice the use of URL encoding to make sure the server misinterprets no characters with a special meaning (like the equal sign).

If you are familiar with forms authentication in ASP.NET you’ll know the runtime issues a cookie to the browser when a user has successfully authenticated themselves. On subsequent requests, the browser needs to pass along the cookie value to reach protected resources. Unfortunately, I have not found an easy way for the WebClient to work with cookie values, so we will try a more advanced API with the HttpWebRequest class.

HttpWebRequest

The code using HttpWebRequest will look a bit different than what we have seen with WebClient. HttpWebRequest uses streams to write form values into the request and read the response. We also need to add some code to handle the forms authentication cookie. This final code example will successfully login to a website and pull the HTML from a protected resource.

private void Button5_Click(object sender, System.EventArgs e)
{
   // first, request the login form to get the viewstate value
   HttpWebRequest webRequest = WebRequest.Create(LOGIN_URL) as HttpWebRequest;         
   StreamReader responseReader = new StreamReader(
         webRequest.GetResponse().GetResponseStream()
      );
   string responseData = responseReader.ReadToEnd();         
   responseReader.Close();
   
   // extract the viewstate value and build out POST data
   string viewState = ExtractViewState(responseData);       
   string postData = 
         String.Format(
            "__VIEWSTATE={0}&UsernameTextBox={1}&PasswordTextBox={2}&LoginButton=Login",
            viewState, USERNAME, PASSWORD
         );
  
   // have a cookie container ready to receive the forms auth cookie
   CookieContainer cookies = new CookieContainer();

   // now post to the login form
   webRequest = WebRequest.Create(LOGIN_URL) as HttpWebRequest;
   webRequest.Method = "POST";
   webRequest.ContentType = "application/x-www-form-urlencoded";
   webRequest.CookieContainer = cookies;        
   
   // write the form values into the request message
   StreamWriter requestWriter = new StreamWriter(webRequest.GetRequestStream());
   requestWriter.Write(postData);
   requestWriter.Close();
   
   // we don't need the contents of the response, just the cookie it issues
   webRequest.GetResponse().Close();
   
   // now we can send out cookie along with a request for the protected page
   webRequest = WebRequest.Create(SECRET_PAGE_URL) as HttpWebRequest;
   webRequest.CookieContainer = cookies;
   responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());
   
   // and read the response
   responseData = responseReader.ReadToEnd();
   responseReader.Close();
   
   Response.Write(responseData);         
}

If you have been following along, the above code should make some sense, even though the HttpWebRequest class requires us to do a more work. For instance, instead of using the UploadData method of WebClient to POST and have a response, we need to get the request stream, write into the request stream, get the response stream, and read from the response stream. Notice the use of the CookieContainer class to keep the authentication ticket alive in our request.

Conclusion

Until every single web site on the Internet offers a web service to programmatically retrieve data, screen scraping will be around. It’s good to know a few tricks to fetch content from the web using code and classes in the .NET framework.

By K. Scott Allen

Additional Resources
以下是对提供的参考资料的总结,按照要求结构化多个要点分条输出: 4G/5G无线网络优化与网规案例分析: NSA站点下终端掉4G问题:部分用户反馈NSA终端频繁掉4G,主要因终端主动发起SCGfail导致。分析显示,在信号较好的环境下,终端可能因节能、过热保护等原因主动释放连接。解决方案建议终端侧进行分析处理,尝试关闭节电开关等。 RSSI算法识别天馈遮挡:通过计算RSSI平均值及差值识别天馈遮挡,差值大于3dB则认定有遮挡。不同设备分组规则不同,如64T和32T。此方法可有效帮助现场人员识别因环境变化引起的网络问题。 5G 160M组网小区CA不生效:某5G站点开启100M+60M CA功能后,测试发现UE无法正常使用CA功能。问题原因在于CA频点集标识配置错误,修正后测试正常。 5G网络优化与策略: CCE映射方式优化:针对诺基亚站点覆盖农村区域,通过优化CCE资源映射方式(交织、非交织),提升RRC连接建立成功率和无线接通率。非交织方式相比交织方式有显著提升。 5G AAU两扇区组网:与三扇区组网相比,AAU两扇区组网在RSRP、SINR、下载速率和上传速率上表现不同,需根据具体场景选择适合的组网方式。 5G语音解决方案:包括沿用4G语音解决方案、EPS Fallback方案和VoNR方案。不同方案适用于不同的5G组网策略,如NSA和SA,并影响语音连续性和网络覆盖。 4G网络优化与资源利用: 4G室分设备利旧:面对4G网络投资压减与资源需求矛盾,提出利旧多维度调优策略,包括资源整合、统筹调配既有资源,以满足新增需求和提质增效。 宏站RRU设备1托N射灯:针对5G深度覆盖需求,研究使用宏站AAU结合1托N射灯方案,快速便捷地开通5G站点,提升深度覆盖能力。 基站与流程管理: 爱立信LTE基站邻区添加流程:未提供具体内容,但通常涉及邻区规划、参数配置、测试验证等步骤,以确保基站间顺畅切换和覆盖连续性。 网络规划与策略: 新高铁跨海大桥覆盖方案试点:虽未提供详细内容,但可推测涉及高铁跨海大桥区域的4G/5G网络覆盖规划,需考虑信号穿透、移动性管理、网络容量等因素。 总结: 提供的参考资料涵盖了4G/5G无线网络优化、网规案例分析、网络优化策略、资源利用、基站管理等多个方面。 通过具体案例分析,展示了无线网络优化中的常见问题及解决方案,如NSA终端掉4G、RSSI识别天馈遮挡、CA不生效等。 强调了5G网络优化与策略的重要性,包括CCE映射方式优化、5G语音解决方案、AAU扇区组网选择等。 提出了4G网络优化与资源利用的策略,如室分设备利旧、宏站RRU设备1托N射灯等。 基站与流程管理方面,提到了爱立信LTE基站邻区添加流程,但未给出具体细节。 新高铁跨海大桥覆盖方案试点展示了特殊场景下的网络规划需求。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值