转帖地址:http://blog.sina.com.cn/s/blog_414cc36d0100lkf1.html
其实经常使用.NET和JAVA抓取过网页,一直没有深层次的探索原理,今天有一网友发了一个网址在用.NET抓取的时候就出现了问题,程序似乎陷入一个死循环中,不停抓取页面但没返回任何代码,具体的异常为" 尝试自动重定向的次数太多 " ,GOOGLE了一回说是要在 HttpWebRequest 中设置一个实际的 CookieContainer 对象,用来容纳COOKIE , 试过了,但没效果,最初的代码见下
public string login() { WebResponse myResponse = null ; byte[] data = Encoding.UTF8.GetBytes("uid=&langx=zh-cn&mac=&ver=&JE=true&username=acbtrnn9&passwd=qqq111"); //HttpWebRequest myRequest = WebRequest.Create("http://asd10000.com/app/member/login.php ") as HttpWebRequest; string url; url = "http://asd10000.com/app/member/login.php "; HttpWebRequest myRequest = WebRequest.Create(url) as HttpWebRequest; myRequest.Method = "POST"; myRequest.ContentType = "application/x-www-form-urlencoded"; myRequest.AllowAutoRedirect = true; myRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)"; myRequest.KeepAlive = true; myRequest.CookieContainer = cc; //myRequest.Headers.Add("Cookie", "nn1aab3b=5xPgv3TKq3uURmNsaeuycQ=="); myRequest.Referer = "http://asd10000.com/app/member/ "; Stream newStream = myRequest.GetRequestStream(); newStream.Write(data, 0, data.Length); newStream.Close(); myResponse = myRequest.GetResponse(); string str; //采用流读取,并确定编码方式 using (Stream s = myResponse.GetResponseStream()) { StreamReader objReader = new StreamReader(s, Encoding.UTF8); str = objReader.ReadToEnd(); } return str; } |
在FF调试了下,发现FF在访问对应页面的时候用到了COOKIE,但是使用Http Analyzer工具在抓取上面程序的请求包时,虽然创建了CookieContainer,但这个实例中没有任何初始的COOKIE信息,所以导致页面在服务端由于不能访问到COOKIE使得服务器页不停重定向到登录页,所以程序陷入了死循环.
所以知道原理后,很容易解决问题,有两种办法
第一种,直接将COOKIE写到请求头中 myRequest.Headers.Add("Cookie", "nn1aab3b=5xPgv3TKq3uURmNsaeuycQ==");
第二种,还是使用CookieContainer ,事先可以初始化一些COOKIE,同样它有两种办法可以做到
cc.Add(new Cookie("nn1aab3b", "5xPgv3TKq3uURmNsaeuycQ==", "/", "asd10000.com"));
cc.SetCookies(new Uri( " http://asd10000.com ") ,"nn1aab3b=5xPgv3TKq3uURmNsaeuycQ==");
但上面有一个问题,就是事先要知道COOKIE的内容,这个内容是在FF登录后,看到的COOKIE内容,如果事先并不知道怎么办呢.
后来后了一下SDK的注释,说是当出现响应状态码为 302 的时候,如果是POST方式,那么默认的将会用GET方式再请求一次.当第一次POST的时候,本地COOKIE的值为空,当请求过后服务器将会写入一段COOKIE到本地,然后IE会再次用GET请求,这时本地COOKIE已经有值了,所以整个过程请求完成.
这样就可以模仿IE来做两次请求, 这里需要 设置 myRequest.AllowAutoRedirect = false;
具体代码如下
public string mylogin() { HttpWebResponse myResponse = null; byte[] data = Encoding.UTF8.GetBytes("uid=&langx=zh-cn&mac=&ver=&JE=true&username=acbtrnn9&passwd=qqq111"); //HttpWebRequest myRequest = WebRequest.Create("http://asd10000.com/app/member/login.php ") as HttpWebRequest; string url; url = "http://asd10000.com/app/member/login.php "; HttpWebRequest myRequest = WebRequest.Create(url) as HttpWebRequest; myRequest.Method = "POST"; myRequest.ContentType = "application/x-www-form-urlencoded"; myRequest.AllowAutoRedirect = false; myRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)"; myRequest.KeepAlive = true; myRequest.CookieContainer = cc; //myRequest.Headers.Add("Cookie", "nn1aab3b=5xPgv3TKq3uURmNsaeuycQ=="); myRequest.Referer = "http://asd10000.com/app/member/ "; Stream newStream = myRequest.GetRequestStream(); newStream.Write(data, 0, data.Length); newStream.Close(); myResponse = myRequest.GetResponse() as HttpWebResponse; if (myResponse.StatusCode == HttpStatusCode.Redirect) { cc.Add(myResponse.Cookies); url = "http://asd10000.com/app/member/login.php "; myRequest = WebRequest.Create(url) as HttpWebRequest; myRequest.Method = "POST"; myRequest.ContentType = "application/x-www-form-urlencoded"; myRequest.AllowAutoRedirect = false; myRequest.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)"; myRequest.KeepAlive = true; myRequest.CookieContainer = cc; //myRequest.Headers.Add("Cookie", "nn1aab3b=5xPgv3TKq3uURmNsaeuycQ=="); myRequest.Referer = "http://asd10000.com/app/member/ "; Stream s = myRequest.GetRequestStream(); s.Write(data, 0, data.Length); s.Close(); myResponse = myRequest.GetResponse() as HttpWebResponse; } string str; //采用流读取,并确定编码方式 using (Stream s = myResponse.GetResponseStream()) { StreamReader objReader = new StreamReader(s, Encoding.UTF8); str = objReader.ReadToEnd(); } return str; } |
上面代码冗余度很高,只是做一个请求的例子,基本思路是这样的,其实真正在搞清楚从IE输入域名到最终呈现页面整个过程是很复杂的,起码要弄清楚HTTP协议,以及各个响应的状态码含义.