正文:
在用Java的HttpURLConnection 来下载网页,发现访问google的网站时,会被google拒绝掉。
try
{
url = new URL(urlStr);
httpConn = (HttpURLConnection) url.openConnection();
HttpURLConnection.setFollowRedirects(true);
// logger.info(httpConn.getResponseMessage());
in = httpConn.getInputStream();
out = new FileOutputStream(new File(outPath));
chByte = in.read();
while (chByte != -1)
{
out.write(chByte);
chByte = in.read();
}
}
catch (MalformedURLException e)
{
}
}
经过一段时间的研究和查找资料,发现是由于上面的代码缺少了一些必要的信息导致,增加更加详细的属性
httpConn.setRequestMethod("GET");
httpConn.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)");
完整代码如下:
public static void DownLoadPages(String urlStr, String outPath)
{
int chByte = 0;
URL url = null;
HttpURLConnection httpConn = null;
InputStream in = null;
FileOutputStream out = null;
try
{
url = new URL(urlStr);
httpConn = (HttpURLConnection) url.openConnection();
HttpURLConnection.setFollowRedirects(true);
httpConn.setRequestMethod("GET");
httpConn.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)");
// logger.info(httpConn.getResponseMessage());
in = httpConn.getInputStream();
out = new FileOutputStream(new File(outPath));
chByte = in.read();
while (chByte != -1)
{
out.write(chByte);
chByte = in.read();
}
}
catch (MalformedURLException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
try
{
out.close();
in.close();
httpConn.disconnect();
}
catch (Exception ex)
{
ex.printStackTrace();
}
}
}
此外,还有第二种方法可以访问Google的网站,就是用apache的一个工具HttpClient 模仿一个浏览器来访问Google
Document document = null;
HttpClient httpClient = new HttpClient();
GetMethod getMethod = new GetMethod(url);
getMethod.setFollowRedirects(true);
int statusCode = httpClient.executeMethod(getMethod);
if (statusCode == HttpStatus.SC_OK)
{
InputStream in = getMethod.getResponseBodyAsStream();
InputSource is = new InputSource(in);
DOMParser domParser = new DOMParser(); //nekoHtml 将取得的网页转换成dom
domParser.parse(is);
document = domParser.getDocument();
System.out.println(getMethod.getURI());
}
return document;
推荐使用第一种方式,使用HttpConnection 比较轻量级,速度也比第二种HttpClient 的快。
转载一些代码,使用HttpUrlConnection来模拟ie form登陆web:
关于java模拟ie form登陆web的问题
HttpURLConnection urlConn=(HttpURLConnection)(new URL(url).openConnection());
urlConn.addRequestProperty("Cookie",cookie);
urlConn.setRequestMethod("POST");
urlConn.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 6.0; Windows 2000)");
urlConn.setFollowRedirects(true);
urlConn.setDoOutput(true); // 需要向服务器写数据
urlConn.setDoInput(true); //
urlConn.setUseCaches(false); // 获得服务器最新的信息
urlConn.setAllowUserInteraction(false);
urlConn.setRequestProperty("Content-Type","application/x-www-form-urlencoded");
urlConn.setRequestProperty("Content-Language","en-US" );
urlConn.setRequestProperty("Content-Length", ""+data.length());
DataOutputStream outStream = new DataOutputStream(urlConn.getOutputStream());
outStream.writeBytes(data);
outStream.flush();
outStream.close();
cookie=urlConn.getHeaderField("Set-Cookie");
BufferedReader br=new BufferedReader(new InputStreamReader(urlConn.getInputStream(),"gb2312"));
本文出处:
http://www.blogjava.net/fisher/articles/86926.aspx