【网页爬虫】第一部分网页请求HttpClient

最新推荐文章于 2022-02-28 10:48:11 发布

hxj19910814

最新推荐文章于 2022-02-28 10:48:11 发布

阅读量379

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/u011490072/article/details/78612799

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

爬虫系列博客将从以下几个方面介绍相对编写网页爬虫核心过程。

【网页爬虫】第一部分网页请求HttpClient

【爬虫系列】第二部分网页解析Jsoup

【爬虫系列】第三部分多线程爬虫框架

【爬虫系列】第四部分爬虫日志记录

【爬虫系列】第五部分 url去重

网页请求HttpClient

HttpClient 是 Apache Common 下的子项目，提供高效的支持 HTTP 协议的客户端编程工具包。HttpClient主要用于模拟浏览器请求url，返回response获取网页数据，然后使用jsoup解析网页，提取我们需要的信息。

一、httpClient发送httpget请求

[java]view plain copy 
    
 public class HttpClientHello {  
     public static void main(String[] args) throws ClientProtocolException, IOException {  
         // 创建httpclient实例  
         CloseableHttpClient httpClient = HttpClients.createDefault();  
         // 创建httpget实例  
         HttpGet httpGet = new HttpGet("http://www.tuicool.com");  
         // 执行http get 请求  
         CloseableHttpResponse response = null;  
         response = httpClient.execute(httpGet);  
         HttpEntity entity = response.getEntity();// 获取返回实体  
         // EntityUtils.toString(entity,"utf-8");//获取网页内容，指定编码  
         System.out.println("網頁內容" + EntityUtils.toString(entity, "gb2312"));  
         response.close();  
         httpClient.close();  
     }  
 }  

二、httpclient模拟浏览器发送请求

使用httpclient直接发送请求，对于某些安全性较高的网站而言，该httpGet请求会被识别非浏览器代理请求而被拒绝访问。所以一般项目都需要使用httpClient模拟浏览器发送请求，浏览器关键参数。

[java]view plain copy 
    
 public class HttpClientHello2 {  
     public static void main(String[] args) throws ClientProtocolException, IOException {  
         //创建httpclient实例  
         CloseableHttpClient httpClient=HttpClients.createDefault();  
         //创建httpget实例  
         HttpGet httpGet=new HttpGet("http://www.tuicool.com");  //系統有限制  
         httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36");          
         //执行http get 请求  
         CloseableHttpResponse response=null;          
         response=httpClient.execute(httpGet);  
         HttpEntity entity=response.getEntity();//获取返回实体  
         //EntityUtils.toString(entity,"utf-8");//获取网页内容，指定编码  
         System.out.println("網頁內容"+EntityUtils.toString(entity,"gb2312"));  
         //查看响应类型  
         System.out.println(entity.getContentType().getValue());  
         System.out.println(response.getStatusLine().getStatusCode());    
         //HTTP/1.1 200 OK    200  
         response.close();         
         httpClient.close();       
     }     
 }  

HttpGet请求除了访问到指定的url之外，还可设置请求头、请求连接等参数信息。这些参数对于真实浏览器发送的http请求是必不可少的，可通过浏览器-network进行查看具体需要设置哪几种参数以及参数值。

对于返回的 response httpEntity，可获取实体的内容编码，以及返回状态码来。返回实体信息主要用于特征识别，使用content-type过滤调不需要的其他相应类型，例如只要text类型的。

三、采集图片

response返回信息中肯定会有对应图片信息，例如图片验证码、图片二维码等登录凭证。在编写爬虫时，网页中图片的采集方案大致如下：

筛选图片：

确认该请求为图片，根据contentType进行筛选。获取是图片的请求，找到图片，然后处理，另存

保存方式：

1、图片服务器，采集好放到图片服务器中

2、直接存到项目目录下，webapp项目路径下或者缓存中

[java]view plain copy 
    
 public class HttpClientHello3img {  
     public static void main(String[] args) throws ClientProtocolException, IOException {  
         //创建httpclient实例  
         CloseableHttpClient httpClient=HttpClients.createDefault();  
         //创建httpget实例  
         HttpGet httpGet=new HttpGet("http://cpro.baidustatic.com/cpro/exp/closead/img/bg_rb.png");  //系統有限制  
         httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36");  
           
         //执行http get 请求  
         CloseableHttpResponse response=null;          
         response=httpClient.execute(httpGet);  
         HttpEntity entity=response.getEntity();//获取返回实体  
         //EntityUtils.toString(entity,"utf-8");//获取网页内容，指定编码  
         //System.out.println("網頁內容"+EntityUtils.toString(entity,"gb2312"));  
         //查看响应类型  
         if(entity!=null)  
         {  
             System.out.println(entity.getContentType().getValue());  
             InputStream input=entity.getContent();  
             FileUtils.copyInputStreamToFile(input, new File("C://111.png"));  
         }         
         System.out.println(response.getStatusLine().getStatusCode());    
         //HTTP/1.1 200 OK    200  
         response.close();         
         httpClient.close();       
     }     
 }  

这里使用了commons-io， apache IO 框架 copyFileto方法，直接将图片流另存为图片文件到指定路径下。

四、动态更换代理IP

一般对于访问量大、安全性高的网站都有各自的反爬策略，其中针对定期、规律性访问的IP会进行拉黑屏蔽处理。定向爬取某类网站数据便需要使用大量代理IP进行访问，避免IP被封的情况发生。其中代理IP根据是否易被目标网站发现分为，透明代理、匿名代理、混淆代理（伪装）、高匿代理（隐蔽性最高，让别人根本无法发现你在使用代理）

一般如果IP被封，那请求返回状态一般为403，拒绝访问，这时则需要换一个ip再去访问。另外代理 ip管理策略，如果项目没有专门购买批量代理Ip则需要自行在网站上抓代理ip，把抓到的ip都放到先进先出的队列里。n个队列，例如每个40个IP放到1个队列里，队列里<10个了，再去抓。队列里的IP先进先出，避免重复使用。养成代理IP池。

[java]view plain copy 
    
 public class HttpClientHello4Agent {  
     public static void main(String[] args) throws ClientProtocolException, IOException {  
         //创建httpclient实例  
         CloseableHttpClient httpClient=HttpClients.createDefault();  
         //创建httpget实例  
         HttpGet httpGet=new HttpGet("http://www.tuicool.com");  //系統有限制  
         httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36");  
         HttpHost proxy=new HttpHost("182.204.18.65",8118);  
         RequestConfig config=RequestConfig.custom().setProxy(proxy).build();  
         httpGet.setConfig(config);  
         //执行http get 请求  
         CloseableHttpResponse response=null;          
         response=httpClient.execute(httpGet);  
         HttpEntity entity=response.getEntity();//获取返回实体  
         //EntityUtils.toString(entity,"utf-8");//获取网页内容，指定编码  
         System.out.println("網頁內容"+EntityUtils.toString(entity,"utf-8"));  
         //查看响应类型  
         if(entity!=null)  
         {  
             System.out.println(entity.getContentType().getValue());  
             InputStream input=entity.getContent();  
             FileUtils.copyInputStreamToFile(input, new File("C://111.png"));  
         }         
         System.out.println(response.getStatusLine().getStatusCode());    
         //HTTP/1.1 200 OK    200  
         response.close();         
         httpClient.close();       
     }     
 }  

五、 HttpClient连接超时和读取超时

httpClient在执行具体http请求时候有一个连接的时间和读取内容的时间；

1、连接时间

是HttpClient发送请求的地方开始到连接上目标url主机地址的时间，理论上是距离越短越快，线路越通畅越快。HttpClient默认连接超时时间是1min，超过1min过一会儿会再次尝试连接。也可人工设定连接超时时间，如10s。

HttpClient读取时间

2、读取时间

是HttpClient已经连接到了目标服务器，然后进行内容数据的获取，一般情况读取数据都是很快速的，但是假如读取的数据量大，或者是目标服务器本身的问题（比如读取数据库速度慢，并发量大等等..）也会影响读取时间。读取时间可也根据业务具体设定。

HttpClient提供了一个RequestConfig类专门用于配置参数比如连接时间，读取时间以及代理IP等。

[java]view plain copy 
    
 public class HttpClientHello5Timeout {  
     public static void main(String[] args) throws ClientProtocolException, IOException {  
         //创建httpclient实例  
         CloseableHttpClient httpClient=HttpClients.createDefault();  
         //创建httpget实例  
         HttpGet httpGet=new HttpGet("http://www.tuicool.com");  //系統有限制  
         httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36");  
           
         RequestConfig config=RequestConfig.custom()  
                     .setConnectTimeout(10000) //设置连接超时时间为10s  
                     .setSocketTimeout(10000) //设置读取超时时间为10s  
                     .build();  
         httpGet.setConfig(config);  
           
         //执行http get 请求  
         CloseableHttpResponse response=null;          
         response=httpClient.execute(httpGet);  
         HttpEntity entity=response.getEntity();//获取返回实体  
         //EntityUtils.toString(entity,"utf-8");//获取网页内容，指定编码  
         System.out.println("網頁內容"+EntityUtils.toString(entity,"utf-8"));  
         //查看响应类型  
         if(entity!=null)  
         {  
             System.out.println(entity.getContentType().getValue());  
             InputStream input=entity.getContent();  
             FileUtils.copyInputStreamToFile(input, new File("C://111.png"));  
         }         
         System.out.println(response.getStatusLine().getStatusCode());    
         //HTTP/1.1 200 OK    200  
         response.close();         
         httpClient.close();       
     }     
 }  

hxj19910814

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【网页爬虫】第一部分网页请求HttpClient

爬虫系列博客将从以下几个方面介绍相对编写网页爬虫核心过程。【网页爬虫】第一部分网页请求HttpClient 【爬虫系列】第二部分网页解析Jsoup 【爬虫系列】第三部分多线程爬虫框架【爬虫系列】第四部分爬虫日志记录【爬虫系列】第五部
复制链接

扫一扫