咕咕咕~
总所周知httpclient是java爬虫的利器,
一般我个人开发,都是用httpclient来编写抓取登陆代理等,用jsoup,xpath,正则来处理解析。
废话不多说直接上代码。
public static String getPageContent(String url) {
// 创建一个客户端,类似于打开一个浏览器
DefaultHttpClient httpClient = new DefaultHttpClient();
// 创建一个GET方法,类似在浏览器地址栏中输入一个地址
HttpGet httpGet = new HttpGet(url);
String content = "";
try {
// 类似与在浏览器地址栏中输入回车,获得网页内容
HttpResponse response = httpClient.execute(httpGet);
// 查看返回内容
HttpEntity entity = response.getEntity();
if (entity != null) {
content += EntityUtils.toString(entity, "utf-8");
EntityUtils.consume(entity);// 关闭内容流
}
} catch (Exception e) {
logger.error("网页获取内容失败:" + e);
}
httpClient.getConnectionManager().shutdown();
return content;
}
这就是一个简易版的httpclient抓取的代码,用的是defaulthttpclient,需要手动关闭连接,否则再次连接则会冲突。
当然也可以用CloseableHttpClient statichttpClient = HttpClients.createDefault();则更为方便。
上述代码有没有问题呢,没有。
但是也有,为什么这么说呢,因为忽视了header的设置,许多网站会直接屏蔽这样的请求。
那咋办?
我们可以改成这样:
public static String getPageContent_addHeader(String url) {
CloseableHttpClient httpclient = HttpClients.createDefault();
try {
HttpGet httpget = new HttpGet(url);
httpget.addHeader("Accept", Accept);
httpget.addHeader("Accept-Charset", Accept_Charset);
httpget.addHeader("Accept-Encoding", Accept_EnCoding);
httpget.addHeader("Accept-Language", Accept_Language);
httpget.addHeader("User-Agent", User_Agent);
ResponseHandler<String> responseHandler = new ResponseHandler<String>() {
public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
int status = response.getStatusLine().getStatusCode();
if (status >= 200 && status < 300) {
HttpEntity entity = response.getEntity();
System.out.println(status);
return entity != null ? EntityUtils.toString(entity) : null;
} else {
System.out.println(status);
Date date = new Date();
System.out.println(date);
System.exit(0);
throw new ClientProtocolException("Unexpected response status: " + status);
}
}
};
String responseBody = httpclient.execute(httpget, responseHandler);
return responseBody;
} catch (Exception e) {
logger.error(e);
} finally {
try {
httpclient.close();
} catch (IOException e) {
// TODO 自动生成的 catch 块
logger.error("httpclient未正常关闭");
}
}
return null;
}
加上了些头请求,如下:
private static String User_Agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22";
private static String Accept = "text/html";
private static String Accept_Charset = "utf-8";
private static String Accept_EnCoding = "gzip";
private static String Accept_Language = "en-Us,en";