Httpclient写爬虫

第1部分           了解爬虫

1.1     什么事爬虫

简单通俗的理解,就是通过Http请求模拟用户在浏览器操作行为的代码。

1.2     爬虫能做什么

常用于抓数据,通过一些列的http请求,将别人网站的内容抓到自己的数据库中。

1.3     爬虫的应用场景

刚使用,大家去别地找一找吧。

第2部分           基础知识准备

第2部分      

2.1     什么事http

对于这个问题,大家觉得可能很搞笑,这么简单的问题谁不知道。然而,我更相信绝大多数人都是一知半解。

记得第一次找找工作就有考官问,http与https的区别,还有如何跟踪一个会话。当场就懵掉了,时至今日我也不能全部说的明白。

跟着前辈们学学吧:http://www.cnblogs.com/rayray/p/3729533.html

2.2     Session与cookie

上班坐地铁需要一个多小时的时间,闲来无事看看这边博客: http://blog.csdn.net/fangaoxin/article/details/6952954,感觉头脑清晰了不少,感谢这些爱写博客又这个严谨的前辈们,我们都应当向他们看起。

第3部分           环境准备

第3部分      

3.1     抓包工具fiddler4

网址:https://www.telerik.com/download/fiddler/fiddler4

安装:一路下一步即可,最好放在英文路径下。

3.2     使用fiddler4

1.      我们使用fiddler进行抓包,打开chrome浏览器的隐身模式输入电信官网。可以发现fiddler已经捕获到了我们的操作行为,而我们的浏览器行为就是一个个的http请求。

2.      Ok,我们打开一个请求,查看其详细内容,发现每个一点记录的都非常的清晰。而想写爬虫程序,这些东西都要搞明白。

例如:Headers(错误大量在这里)

      Cookie(如果涉及到权限部分,这里必须注意)

             http响应码,也是至关重要的,他决定你一个请求访问的链,如302必须自己写重定向过程。

3.      我们在看一个登陆的过程,我们可以查看登陆时需要提交的表单内容

4.      那么接下来爬虫的流程便清晰了,首先找到url,之后报文头,最后整理表单数据即可。

第4部分           Demo

Httpclient工具类:

import com.hbc.api.exception.SimpleException;
import org.apache.http.Header;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.NameValuePair;
import org.apache.http.client.HttpClient;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.entity.StringEntity;
import org.apache.http.util.EntityUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

/**
 * 简单的eHttpClient工具类
 *
 * Created by zc on 2017/6/27. */
public class SimpleHttpClientUtil {

    private Logger logger = LoggerFactory.getLogger(getClass()) ;
    private HttpClient httpClient ;
    private HttpClientContext httpClientContext ;

    /**
     * 无参空构造
     */
    public SimpleHttpClientUtil(){}

    /**
     * 含参构造
     * @param client httpClient
     * @param context HttpClientContext
     */
    public SimpleHttpClientUtil(HttpClient client, HttpClientContext context){
        this.httpClient = httpClient ;
        this.httpClientContext = httpClientContext ;
    }

    /**
     * 重定向
     * @param response 重定向上游HttpResponse
     * @param client HttpClient
     * @param context HttpClientContext
     * @return HttpResponse
     */
    public HttpResponse redirectResp(HttpResponse response, HttpClient client, HttpClientContext context){
        Header header=response.getFirstHeader("Location") ;
        return httpGetResp(client,context,header.getValue()) ;
    }

    /**
     * 重定向
     * @param response 重定向上游HttpResponse
     * @param client HttpClient
     * @param context HttpClientContext
     * @return 响应码
     */
    public int redirect(HttpResponse response, HttpClient client, HttpClientContext context){
        Header header=response.getFirstHeader("Location") ;
        return httpGet(client,context,header.getValue()) ;
    }

    public int redirect(HttpResponse response){
        return redirect(response,this.httpClient,this.httpClientContext) ;
    }

    /**
     * GET 方式访问
     *
     * @param client HttpClient
     * @param context HttpClientContext
     * @param url url
     * @return 响应文本内容
     */
    public String httpGetRespTxt(HttpClient client, HttpClientContext context,String url){
        String webTxt ;
        HttpGet httpGet = new HttpGet(url) ;
        HttpResponse response ;
        httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240") ;//逃避反爬虫
        try{
            response = client.execute(httpGet,context) ;
            webTxt = EntityUtils.toString(response.getEntity()) ;
        }catch (Exception e){
            logger.error("GET 方式访问异常:"+url,e) ;
            throw new SimpleException(2,"Http响应:"+e.getMessage()) ;
        }finally {
            httpGet.abort() ;
        }
        return webTxt ;
    }

    /**
     * GET 方式访问
     *
     * @param client HttpClient
     * @param context HttpClientContext
     * @param url url
     * @return HttpResponse
     */
    public HttpResponse httpGetResp(HttpClient client, HttpClientContext context,String url){
        HttpGet httpGet = new HttpGet(url) ;
        HttpResponse response ;
        httpGet.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240") ;//逃避反爬虫
        try{
            response = client.execute(httpGet,context) ;
        }catch (Exception e){
            logger.error("GET 方式访问异常:"+url,e) ;
            throw new SimpleException(2,"Http响应:"+e.getMessage()) ;
        }finally {
            httpGet.abort() ;
        }
        return response ;
    }

    /**
     * GET 方式访问
     *
     * @param client HttpClient
     * @param context HttpClientContext
     * @param url url
     * @return 响应码
     */
    public int httpGet(HttpClient client, HttpClientContext context,String url){
        return httpGetResp(client,context,url).getStatusLine().getStatusCode() ;
    }

    /**
     * GET 方式访问
     * @param url url
     * @return 响应码
     */
    public int httpGet(String url) {
        return httpGet(this.httpClient,this.httpClientContext,url) ;
    }

    /**
     * post json请求
     * @param client HttpClient
     * @param context HttpClientContext
     * @param url url
     * @param json json
     * @return json
     */
    public String httpPostJson(HttpClient client, HttpClientContext context,String url,String json){
        HttpPost httpPost = new HttpPost(url);
        httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36") ;//逃避反爬虫
        String resultJson="" ;
        HttpResponse httpResponse = null ;
        try {
            StringEntity se = new StringEntity(json);
            se.setContentEncoding("UTF-8");
            se.setContentType("application/json");//发送json数据需要设置contentType
            httpPost.setEntity(se);
            httpResponse = client.execute(httpPost);
            if(httpResponse.getStatusLine().getStatusCode() == HttpStatus.SC_OK){
                resultJson= EntityUtils.toString(httpResponse.getEntity());// 返回json格式:
            }
        } catch (Exception e) {
            System.out.println("httpPostJson:"+httpResponse.getStatusLine());
            logger.error("POST 方式访问异常:"+url+" httpPostJson:"+httpResponse.getStatusLine(),e) ;
            e.printStackTrace();
            throw new RuntimeException(e);
        }
        return resultJson ;
    }
    /**
     * POST 方式访问
     * @param client HttpClient
     * @param context HttpClientContext
     * @param url url
     * @param nvp List<NameValuePair>
     * @return 响应文本内容
     */
    public String httpPostRespTxt(HttpClient client, HttpClientContext context,String url,List<NameValuePair> nvp){
        String webTxt;
        HttpPost httpPost = new HttpPost(url) ;
        httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36") ;//逃避反爬虫
        HttpResponse response ;
        try{
            httpPost.setEntity(new UrlEncodedFormEntity(nvp,"UTF-8")) ;
            response = client.execute(httpPost,context) ;
            webTxt = EntityUtils.toString(response.getEntity()) ;
        }catch (Exception e){
            logger.error("POST 方式访问异常:"+url,e) ;
            throw new SimpleException(2,"Http响应:"+e.getMessage()) ;
        }finally {
            httpPost.abort() ;
        }
        return webTxt ;
    }

    /**
     * POST 方式访问
     * @param client HttpClient
     * @param context HttpClientContext
     * @param url url
     * @param nvp List<NameValuePair>
     * @return HttpResponse
     */
    public HttpResponse httpPostResp(HttpClient client, HttpClientContext context,String url,List<NameValuePair> nvp){
        HttpPost httpPost = new HttpPost(url) ;
        httpPost.addHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36") ;//逃避反爬虫
        HttpResponse response ;
        try{
            httpPost.setEntity(new UrlEncodedFormEntity(nvp,"UTF-8")) ;
            response = client.execute(httpPost,context) ;
        }catch (Exception e){
            logger.error("POST 方式访问异常:"+url,e) ;
            throw new SimpleException(2,"Http响应:"+e.getMessage()) ;
        }finally {
            httpPost.abort() ;
        }
        return response ;
    }
    /**
     * POST 方式访问
     * @param client HttpClient
     * @param context HttpClientContext
     * @param url url
     * @param nvp List<NameValuePair>
     * @return 响应码
     */
    public int httpPost(HttpClient client, HttpClientContext context,String url,List<NameValuePair> nvp){
        return httpPostResp(client,context,url,nvp).getStatusLine().getStatusCode() ;
    }

    /**
     * POST 方式访问
     * @param url url
     * @param nvp List<NameValuePair>
     * @return 响应码
     */
    public int httpPost(String url,List<NameValuePair> nvp){
        return httpPost(this.httpClient,this.httpClientContext,url,nvp) ;
    }
}

登陆:

private final SimpleHttpClientUtil simpleHttpClientUtil = new SimpleHttpClientUtil() ;


    /**
     * 登陆电信:<br/>
     *  电信登陆页面是统一的<a href="http://login.189.cn/login"> http://login.189.cn/login</a>,
     * 登陆完成后重定向到各省二级页面http://www.189.cn/省名简写/<br/>
     * @param httpClient httpClient
     * @param httpClientContext httpClientContext
     * @param provinceID 省、自治区、直辖市编号
     * @param cityCode 省、自治区、直辖市英文简写
     * @param mobile 手机号码
     * @param pwd 服务密码
     * @return HttpResponse
     */
    final boolean loginDx(HttpClient httpClient,HttpClientContext httpClientContext,String provinceID,String cityCode,String mobile,String pwd){
        //提交表单
        HttpResponse response = submitLoginForm(httpClient,httpClientContext,provinceID,mobile,pwd) ;
        //登陆重定向
        boolean redirectLoginFlag = simpleRedirect(response,httpClient,httpClientContext) ;
        //重定向 到该省二级域名
        boolean redirectToProvinceFlag = redirectToProvince(httpClient,httpClientContext,cityCode) ;
        return redirectLoginFlag && redirectToProvinceFlag ;
    }
    /**
     * 提交登陆表单<br/>
     * 登陆分三步:<br/>
     *  1.提交登陆表单<br/>
     *  2.单点登录重定向<br/>
     *  3.重定向到省份二级域名<br/><hr/>
     * @param httpClient httpClient
     * @param httpClientContext httpClientContext
     * @param provinceID 省、自治区、直辖市编号
     * @param mobile 手机号码
     * @param pwd 服务密码
     * @return HttpResponse
     */
    private HttpResponse submitLoginForm(HttpClient httpClient,HttpClientContext httpClientContext,String provinceID,String mobile,String pwd){
        String url = "http://login.189.cn/login" ;
        List<NameValuePair> nvp = new ArrayList<>() ;
        nvp.add(new BasicNameValuePair("AreaCode", "")) ;
        nvp.add(new BasicNameValuePair("CityNo", "")) ;
        nvp.add(new BasicNameValuePair("Captcha", "")) ;
        nvp.add(new BasicNameValuePair("Account", mobile)) ;
        nvp.add(new BasicNameValuePair("UType", "201")) ;
        nvp.add(new BasicNameValuePair("ProvinceID", provinceID)) ;
        nvp.add(new BasicNameValuePair("RandomFlag", "0")) ;
        nvp.add(new BasicNameValuePair("Password", CryptoJsUtil.getInstance().encryptByAes(pwd))) ;
        return this.simpleHttpClientUtil.httpPostResp(httpClient,httpClientContext,url,nvp) ;
    }

    /**
     * 简单重定向
     * @param response 重定向上游HttpResponse
     * @param httpClient HttpClient
     * @param httpClientContext HttpClientContext
     * @return 重定向是否成功
     */
    private boolean simpleRedirect(HttpResponse response,HttpClient httpClient,HttpClientContext httpClientContext){
        return this.simpleHttpClientUtil.redirect(response,httpClient,httpClientContext)==200 ;
    }

    /**
     * 重定向 到 http://www.189.cn/省份
     * @param httpClient httpClient
     * @param httpClientContext httpClientContext
     * @param cityCode 省、自治区、直辖市英文简写
     * @return 重定向结果
     */
    private boolean redirectToProvince(HttpClient httpClient,HttpClientContext httpClientContext,String cityCode){
        String url = "http://www.189.cn/"+cityCode ;
        return this.simpleHttpClientUtil.httpGet(httpClient,httpClientContext,url)==200 ;
    }


其余的就好弄了,需要注意的是,如果操作分两次请求,如在获取详单时发送短信验证码,和校验短信验证码是,就简单的解决办法就是将CookieStore进行缓存。

前一布操作是进行缓存:

redisUtil.set("cookie_"+mobile,httpClientContext.getCookieStore(),60*2L) ;

下一步提取:

HttpClientBuilder builder = HttpClients.custom() ;
        HttpClient httpClient = builder.build() ;//创建client实例
        HttpClientContext httpClientContext = HttpClientContext.create() ;//创建一个上下文,cookie会自动跟踪
        CookieStore cookieStore = (CookieStore)redisUtil.get("cookie_"+mobile) ;
        httpClientContext.setCookieStore(cookieStore) ;

这样我们就可以轻松搞定cookie问题。


未完待续。。。。



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值