使用Httpclient 进行表单登录，获取数据使用jsoup处理数据

最新推荐文章于 2022-06-08 17:11:39 发布

酷泽

最新推荐文章于 2022-06-08 17:11:39 发布

阅读量406

点赞数 1

分类专栏： java 模拟请求

本文链接：https://blog.csdn.net/weixin_43957085/article/details/107861662

版权

java 同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

模拟请求

1 篇文章 0 订阅

订阅专栏

技术仅供学习

背景：爬取相关数据，完成数据的录入。技术栈 Httpclient，jsoup。
要领：要模仿人访问网页的步骤一步一步实现代码，不要跳过某一步直接去查结果。该设置的请求头不要忘记，可以先写上后面测试的时候再一点点删除没用的请求头。
目的：要完成数据的自动爬取就一定要自动登录该网站拿到cookie。完成数据的解析，进行数据录入数据库（本文代码没设置cookie请求头的原因是使用的同一个Httpclient客户端该对象会像浏览器一样缓存cookie 故要先进行登录）

httpclient进行连接并数据获取

该get请求根据参数的不同会得到不同的查询结果
在这里插入图片描述
获取该数据需要的请求头注意有一个referer ！该网站必须被访问一次否则会查不到数据
如前面说的一定要一步一步来直接请求这个连接是不行的猜测只有先请求了referer这个连接 cookie才会生效
本人拿postman尝试过不先访问一遍这个referer连接直接拿从登录页面获取到的cookie访问不会返回数据的。

在这里插入图片描述

故要先用httpclient访问一下这个请求。
在这里插入图片描述

登录的账号密码，这是post请求。
在这里插入图片描述

RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(5000).setConnectTimeout(5000).build();
        CloseableHttpClient httpclient = HttpClients.createDefault();

        // 添加登录数据
        String cookieUrl = "这个url是登录要用的 要通过他来缓存cookie";
        HttpPost httpPost = new HttpPost(cookieUrl);
        List<NameValuePair> parameters = new ArrayList<>(0);
        parameters.add(new BasicNameValuePair("func", "login"));
        parameters.add(new BasicNameValuePair("UserName", "guest"));
        parameters.add(new BasicNameValuePair("PassWord", "0oXjO9RdU2DIORqWYPlTcQ=="));
        // 构造一个form表单式的实体
        UrlEncodedFormEntity formEntity = new UrlEncodedFormEntity(parameters);
        // 将请求实体设置到httpPost对象中
        httpPost.setEntity(formEntity);
        CloseableHttpResponse cookiesResponse = httpclient.execute(httpPost);
        System.out.println("登陆成功:" + cookiesResponse.getStatusLine());
        cookiesResponse.close();
        //参考的链接必须执行一次
        String refs = "这是被参考的url 要提前执行一次 作为查询请求的请求头";
        HttpGet refsGet = new HttpGet(refs);
        refsGet.setConfig(requestConfig);
        refsGet.setHeader("Upgrade-Insecure-Requests", "1");
        CloseableHttpResponse refsGetR = httpclient.execute(refsGet);
        refsGetR.close();

String dataUrl = "该url 要拼接参数 来查询数据";
                System.out.println(dataUrl);
                HttpGet httpGet = new HttpGet(dataUrl);
                httpGet.setConfig(requestConfig);
                httpGet.setHeader("Upgrade-Insecure-Requests", "1");
                httpGet.setHeader("Referer", refs);
                CloseableHttpResponse dataResponse;
                try{
                    dataResponse = httpclient.execute(httpGet);
                }catch (Exception e){
                    //todo 记录下来什么时候没查询
                    continue;
                }

                HttpEntity entity = dataResponse.getEntity();
                String html = "";
                if (entity != null) {
                    html = EntityUtils.toString(entity, "gbk");//用utf-8也行 看响应头决定
                }

数据的解析

//由于获取的数据是html 所以打印到控制台 会有很多转义字符 我也没什么好的办法，所以直接字符串替换了
String sResult = html.replaceAll("&lt;", "<").replaceAll("&gt;", ">").replaceAll("&#47;", "/");
                Document doc = Jsoup.parse(sResult);
                Elements thead_tr = doc.select("thead tr");
                Elements th = thead_tr.get(0).getElementsByTag("th");
                String[] head = th.text().split(" ");

                //获取数据内容
                Elements align = doc.getElementsByAttribute("align");
                Elements trs = align.get(0).getElementsByTag("tr");
                List<TxComDischarge> txlist = new LinkedList<>();
                TxComDischarge tx;
                for(int i=0;i<txlist.size();i++)
                //完成数据的赋值并把符合要求的对象 add入队列
             

                //最后用mybatis-plus完成数据的录入
                txComDischargeService.saveBatch(txlist);

import org.apache.http.NameValuePair;
import org.apache.http.HttpEntity;
import org.apache.http.client.CookieStore;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.6</version>
    </dependency>
    <!--jsoup 解析html-->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.11.3</version>
    </dependency>

酷泽

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Httpclient 进行表单登录，获取数据使用jsoup处理数据

技术仅供学习背景：爬取相关数据，完成数据的录入。技术栈 Httpclient，jsoup。要领：要模仿人访问网页的步骤一步一步实现代码，不要跳过某一步直接去查结果。该设置的请求头不要忘记，可以先写上后面测试的时候再一点点删除没用的请求头。目的：要完成数据的自动爬取就一定要自动登录该网站拿到cookie。完成数据的解析，进行数据录入数据库（本文代码没设置cookie请求头的原因是使用的同一个Httpclient客户端该对象会像浏览器一样缓存cookie 故要先进行登录）httpclient进行
复制链接

扫一扫