五种实现网络爬虫的方法（三，基于httpclient编写爬虫）

最新推荐文章于 2024-05-29 21:07:50 发布

AaronLin_

最新推荐文章于 2024-05-29 21:07:50 发布

阅读量2.5k

点赞数 4

分类专栏：爬虫 java 文章标签： java爬虫 httpclient 网络爬虫网页抓取

本文链接：https://blog.csdn.net/qq_35159818/article/details/82318771

版权

爬虫同时被 2 个专栏收录

10 篇文章 2 订阅

订阅专栏

java

7 篇文章 0 订阅

订阅专栏

咕咕咕~

总所周知httpclient是java爬虫的利器，

一般我个人开发，都是用httpclient来编写抓取登陆代理等，用jsoup，xpath，正则来处理解析。

废话不多说直接上代码。

public static String getPageContent(String url) {
		// 创建一个客户端，类似于打开一个浏览器
		DefaultHttpClient httpClient = new DefaultHttpClient();
		// 创建一个GET方法，类似在浏览器地址栏中输入一个地址
		HttpGet httpGet = new HttpGet(url);
		String content = "";
		try {
			// 类似与在浏览器地址栏中输入回车,获得网页内容
			HttpResponse response = httpClient.execute(httpGet);
			// 查看返回内容
			HttpEntity entity = response.getEntity();
			if (entity != null) {
				content += EntityUtils.toString(entity, "utf-8");
				EntityUtils.consume(entity);// 关闭内容流
			}
		} catch (Exception e) {
			logger.error("网页获取内容失败:" + e);
		}
		httpClient.getConnectionManager().shutdown();
		return content;
	}

这就是一个简易版的httpclient抓取的代码，用的是defaulthttpclient，需要手动关闭连接，否则再次连接则会冲突。

当然也可以用CloseableHttpClient statichttpClient = HttpClients.createDefault();则更为方便。

上述代码有没有问题呢，没有。

但是也有，为什么这么说呢，因为忽视了header的设置，许多网站会直接屏蔽这样的请求。

那咋办？

我们可以改成这样：

public static String getPageContent_addHeader(String url) {
		CloseableHttpClient httpclient = HttpClients.createDefault();

		try {
			HttpGet httpget = new HttpGet(url);
			httpget.addHeader("Accept", Accept);
			httpget.addHeader("Accept-Charset", Accept_Charset);
			httpget.addHeader("Accept-Encoding", Accept_EnCoding);
			httpget.addHeader("Accept-Language", Accept_Language);
			httpget.addHeader("User-Agent", User_Agent);
			ResponseHandler<String> responseHandler = new ResponseHandler<String>() {

				public String handleResponse(final HttpResponse response) throws ClientProtocolException, IOException {
					int status = response.getStatusLine().getStatusCode();
					if (status >= 200 && status < 300) {
						HttpEntity entity = response.getEntity();
						System.out.println(status);
						return entity != null ? EntityUtils.toString(entity) : null;
					} else {
						System.out.println(status);
						Date date = new Date();
						System.out.println(date);
						System.exit(0);
						throw new ClientProtocolException("Unexpected response status: " + status);
					}
				}
			};
			String responseBody = httpclient.execute(httpget, responseHandler);
			return responseBody;
		} catch (Exception e) {
			logger.error(e);
		} finally {
			try {
				httpclient.close();
			} catch (IOException e) {
				// TODO 自动生成的 catch 块
				logger.error("httpclient未正常关闭");
			}
		}
		return null;
	}

加上了些头请求，如下：

private static String User_Agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.160 Safari/537.22";
	private static String Accept = "text/html";
	private static String Accept_Charset = "utf-8";
	private static String Accept_EnCoding = "gzip";
	private static String Accept_Language = "en-Us,en";

AaronLin_

关注

4
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
五种实现网络爬虫的方法（三，基于httpclient编写爬虫）

咕咕咕~总所周知httpclient是java爬虫的利器，一般我个人开发，都是用httpclient来编写抓取登陆代理等，用jsoup，xpath，正则来处理解析。废话不多说直接上代码。public static String getPageContent(String url) { // 创建一个客户端，类似于打开一个浏览器 DefaultHttpClient http...
复制链接

扫一扫