HttpClient爬虫

最新推荐文章于 2023-07-16 17:11:37 发布

backhorse

最新推荐文章于 2023-07-16 17:11:37 发布

阅读量115

点赞数

分类专栏： Core Java 文章标签：爬虫

本文链接：https://blog.csdn.net/backhorse/article/details/84230287

版权

Core Java 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

HttpClient是一个很方便进行Http连接操作的工具包，用它可以设置代理和模拟浏览器下载网页，对于爬虫程序来说，是特别好的工具。
它提供的 HTTP 的访问主要是通过 GetMethod 类和 PostMethod 类来实现的，他们分别对应了 HTTP Get 请求与 Http Post 请求.
下面我分别给出两种访问方式的代码及注意事项：

get方式：

   //生成 HttpClinet 对象
          HttpClient client = new DefaultHttpClient();   
     //生成 GetMethod 对象
     HttpGet httpGet = new HttpGet( psURL );
     //设置请求头，这里很关键，细节在注意事项中描述
     httpGet.setHeader("user-agent", "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101") ;
     httpGet.setHeader("accept", "*/*") ;
     httpGet.setHeader("accept-language", "zh-CN") ;
     httpGet.setHeader("accept-encoding", "utf-8, deflate") ;
    
     StringBuffer strBuf = new StringBuffer();    
  try {
      //执行 HTTP GET 请求，获取响应
      HttpResponse response = client.execute(httpGet);
      //根据响应码判断是否正确获取响应
      if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) {
          //若正确获得响应则对内容进行处理
          HttpEntity entity = response.getEntity();    
          if (entity != null) {    
              BufferedReader reader = new BufferedReader(    
                  new InputStreamReader(entity.getContent(), "UTF-8"));    
              String line = null;    
              if (entity.getContentLength() > 0) {    
                  strBuf = new StringBuffer((int) entity.getContentLength());    
                  while ((line = reader.readLine()) != null) {    
                      strBuf.append(line);    
                  }    
              }    
          }    
   //释放连接
          if (entity != null) {    
              EntityUtils.consume(entity);    
          }    
      } 
  } catch (ClientProtocolException e) {
   
   e.printStackTrace();
  } catch (IOException e) {
   
   e.printStackTrace();
  }

post方式：

//存放参数
 List<NameValuePair> nvps= new ArrayList<NameValuePair>();
 nvps.add(new BasicNameValuePair("txtUserName", UAERNAME));
 nvps.add(new BasicNameValuePair("txtPassword", PASSWORD));
 nvps.add(new BasicNameValuePair("ImageButton1.x", "17"));
 nvps.add(new BasicNameValuePair("ImageButton1.y", "10"));
 nvps.add(new BasicNameValuePair("__VIEWSTATE", FLAG));
 //生成 PostMethod 对象
 HttpPost httpPostLogin = new HttpPost(URL)
 //添加参数
 httpPostLogin.setEntity(new UrlEncodedFormEntity(nvps, HTTP.UTF_8));
 //以下为执行请求，获取响应，并对响应进行处理，与get方式类似
 httpResponseLogin = getHttpClientInstance().execute(httpPostLogin);
 status=httpResponseLogin.getStatusLine().getStatusCode();
   
 if (status == 302||status==HttpStatus.SC_OK) {
  Header[] locationHeader = httpResponseLogin.getHeaders("location");
  if (locationHeader != null) {
   HttpEntity httpEntityLogin = httpResponseLogin.getEntity();
     
   if (httpEntityLogin != null) {
    EntityUtils.consume(httpEntityLogin);
      
     }
    }
   }
   
  } catch (Exception e1) {
   // TODO Auto-generated catch block
   e1.printStackTrace();
  }

以下为我在写爬虫程序的时候遇到的问题及解决方案，权作注意事项吧：

1、关于get方式设置请求头的问题，最令人头疼：
1）、上面实例中列出的四种请求头一般情况下已够用，无需添加。
2）、要特别注意accept-encoding的值，有两种:gzip, deflate 和utf-8, deflate。
gzip是经过压缩的大量数据的，一般不常用；
utf-8符合一般使用规律，具体可以两种都尝试一下，或者干脆不加该请求头。
2、无论哪种方式，接受后一定要使用
EntityUtils.consume(entity);释放连接

backhorse

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
HttpClient爬虫

HttpClient是一个很方便进行Http连接操作的工具包，用它可以设置代理和模拟浏览器下载网页，对于爬虫程序来说，是特别好的工具。它提供的 HTTP 的访问主要是通过 GetMethod 类和 PostMethod 类来实现的，他们分别对应了 HTTP Get 请求与 Http Post 请求.下面我分别给出两种访问方式的代码及注意事项： get方式： /...
复制链接

扫一扫