为什么爬虫程序抓取同一个页面的url和viewsource看到的url不同?,请高手指点,谢谢!具体如下:源程序:publicstaticvoidmain(String[]args){System.getProperties().put("proxySet"...
为什么爬虫程序抓取同一个页面的url和view source看到的url不同?,请高手指点,谢谢!具体如下:
源程序:
public static void main(String[] args) {
System.getProperties().put("proxySet", "true");
System.getProperties().put("proxyHost", "10.158.140.91");
System.getProperties().put("proxyPort", "80");
connect("http://www.amazon.cn/s/ref=pd_sl_6x34at35kw_b?ie=UTF8&n=658619051&page=1");
}
private static void connect(String urlString) {
URL url = new URL(urlString);
connection = (HttpURLConnection)url.openConnection();
int responseCode = connection.getResponseCode();
String contentType = connection.getContentType();
// Note: contentLength == -1 if NOT KNOWN (i.e. not returned from server)
int contentLength = connection.getContentLength();
URL parentUrl = null;
PageInfo p = new PageInfo(url,parentUrl,contentType,contentLength,responseCode);
InputStreamReader rdr = new InputStreamReader(connection.getInputStream());
BufferedReader in = null;
in = new BufferedReader(rdr);
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
打开页面 viewsource时,看到的连接是
新东方·考研英语词汇词根+联想记忆法 俞敏洪 群言出版社 (2008-02出版)
通过程序得到的连接是:
展开