记第一次写groovy脚本cookie排坑,附登录验证码的解决方案(纯原创整理笔记)

我们项目中用到groovy去做爬虫,这也是我第一次学习爬虫和groovy,毕竟jvm语言学习发现和java还是很像的,相关资料网上都能搜到,这里整理了我实际项目中遇到的问题和解决的大致思路(附代码)

我们项目中是要从12315爬取客诉数据,通过jsoup拿到指定html表格页面数据然后通过list打入我们系统内部,这几步很常见就不说了。

然后主要的坑是在登录的场景,登录需要用户名+密码+验证码,登录后拿取cookie,之后爬接口用这个cookie就行了,用户名密码都有了,这么第一步就是获取验证码

在这里插入图片描述
在这里插入图片描述

其实就是走get请求拿到图片文件,一开始我用共用httpUtil走获取失败,发现了第一个坑,原因在于postman等都是自动转成文件,代码里直接走get请求行不通,所以我的解决办法是需要走IO获取文件并落地

失败案例

static String imgCodeReload(String cookie) {
        String url = "http://consumer.12315.sh.cn/consumer/validateImgLogin";

        HttpUtil httpUtil = HttpUtil.newHttp();
        httpUtil.addHeaders("Cookie", cookie);
        httpUtil.addHeaders("Accept", "*/*");
        Map<String, String> configs = new HashMap<>();
        HttpUtil.Response response = httpUtil.get(url, configs)
        String content = response.content;
        return content;
    }

成功转file获取

/**
     * 下载文件
     * @param url
     * @param dest
     * @return
     */
    static String downloadFileAndGetCode(String cookie, String url) {
        //def fos = new FileOutputStream(dest)
        File file = getImgPath();
        def fos = new FileOutputStream(file);
        def out = new BufferedOutputStream(fos)
        //必须带cookie拿到图片文件,否则登录不通过
        out << getInputStream(cookie,url) //<< new URL(url).openStream()
        fos.close();
        out.close()
        String orcText = ""
        int count = 0
        //重试三次,图片识别有误判几率会读出2313.和23o1
        while ((!isNumeric(orcText) || orcText.length() != 4) && count++ < 3) {
            orcText = CommonUtils.getImgOcrText("https://xxxxxxx.com/fileUpload/singleUpload", file, "eng");
            orcText = orcText.replaceAll("\r|\n*", "")
            System.out.println(orcText);
        }
        file.delete();
        return orcText;
    }
    
    static File getImgPath() {
        File dirFile = new File(".jpg");
        if (!dirFile.exists()) {
            dirFile.mkdirs();
        }
        return new File(dirFile, CommonUtils.createId() + ".jpg");
    }
    
	static boolean isNumeric(String str) {
        if (str == "") return false
        Pattern pattern = Pattern.compile("[0-9]*");
        return pattern.matcher(str).matches();
    }

    public static InputStream getInputStream(String cookie,String urlReq) {
        InputStream inputStream = null;
        HttpURLConnection httpURLConnection = null;
        try {
            URL url = new URL(urlReq);//创建的URL
            if (url != null) {
                httpURLConnection = (HttpURLConnection) url.openConnection();//打开链接
                httpURLConnection.setConnectTimeout(3000);//设置网络链接超时时间,3秒,链接失败后重新链接
                httpURLConnection.setDoInput(true);//打开输入流
                httpURLConnection.setRequestMethod("GET");//表示本次Http请求是GET方式
                httpURLConnection.setDoOutput(true)

                DataOutputStream outStream = new DataOutputStream(httpURLConnection.getOutputStream());
                outStream.write(cookie.getBytes());


                int responseCode = httpURLConnection.getResponseCode();//获取返回码
                if (responseCode == 200) {//成功为200
                    inputStream = httpURLConnection.getInputStream();
                }
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return inputStream
    }

CommonUtils.getImgOcrText()是我们图片识别的SDK,这里不方便展示

这一步其实就是落地,相对路径创建目录

static File getImgPath() {
        File dirFile = new File(".jpg");
        if (!dirFile.exists()) {
            dirFile.mkdirs();
        }
        return new File(dirFile, CommonUtils.createId() + ".jpg");
    }

这一步做验证,因为12315的验证码场景比较简单,是4位纯数字,这里是做了图片识别会有误判的情况下,做了重试机制,具体处理还是得按照自己的OCR服务来做调整

static boolean isNumeric(String str) {
        if (str == "") return false
        Pattern pattern = Pattern.compile("[0-9]*");
        return pattern.matcher(str).matches();
    }

这一步是重点,为了能让URL带cookie获取文件,需要使用IO流传入cookie的字节

public static InputStream getInputStream(String cookie,String urlReq) {
        InputStream inputStream = null;
        HttpURLConnection httpURLConnection = null;
        try {
            URL url = new URL(urlReq);//创建的URL
            if (url != null) {
                httpURLConnection = (HttpURLConnection) url.openConnection();//打开链接
                httpURLConnection.setConnectTimeout(3000);//设置网络链接超时时间,3秒,链接失败后重新链接
                httpURLConnection.setDoInput(true);//打开输入流
                httpURLConnection.setRequestMethod("GET");//表示本次Http请求是GET方式
                httpURLConnection.setDoOutput(true)

                DataOutputStream outStream = new DataOutputStream(httpURLConnection.getOutputStream());
                outStream.write(cookie.getBytes());


                int responseCode = httpURLConnection.getResponseCode();//获取返回码
                if (responseCode == 200) {//成功为200
                    inputStream = httpURLConnection.getInputStream();
                }
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return inputStream
    }

这里传入到BufferedOutputStream即可,注意这里最后要把落地的file删除
在这里插入图片描述

文件落地后效果如图,这就可以拿file文件去走图片识别的SDK了,然后如果代码发生异常,文件也会落地,而不会走file.delete()这一步,注意不要让应用内图片越堆越多,一定要file.delete()
在这里插入图片描述

上游调用代码,开始登录

static String loginOnAndGetCookie(String cookie,int i) {
        String url = "http://consumer.12315.sh.cn/consumer/logon";
        Map<String, String> args = new HashMap<>()
        args.put("organId", "organId");
        args.put("loginN", "user");
        args.put("loginP", "password");
        args.put("imgcode", downloadFileAndGetCode(cookie, "http://consumer.12315.sh.cn/consumer/validateImgLogin"));

        HttpUtil httpUtil = HttpUtil.newHttp();
        httpUtil.addHeaders("Cookie", cookie);
        Map<String, String> configs = new HashMap<>();
        HttpUtil.Response response = httpUtil.post(url, args, configs);
        String content = response.content;
        System.out.println(content)
        Map<String, String> headers = response.headers;
        String cookieReturn = headers.get("Set-Cookie").toString();
        if (cookieReturn != null)
            cookieReturn = cookieReturn.split("; ")[0];
        System.out.println(cookieReturn)
        if ("rememberMe=deleteMe".equals(cookieReturn)){
            return cookie
        }else if (cookieReturn.contains("consumer_sh=") && i==0) {
            i++;
            return loginOnAndGetCookie(cookieReturn,i)
        }
        return ""
    }

这里递归走了两次是因为第一次调用拿到cookie,用于传了拿到验证码图片和登录用,第二次拿到的就是登陆后的cookie。具体判断是通过resp中header做的,属于特有场景

最后的上游方法就是爬取方法了,一个完整逻辑闭环

static def String process(String url, Map<String, String> args, CrawlerTask crawlerTask, org.dom4j.Document context, SpiderClient client) {
    //args.put(HttpUtil.USER_AGENT,"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36");
    //if (url==null){
    //}
    url = "http://consumer.12315.sh.cn/consumer/accuse/queryResult";
    String endTime = new SimpleDateFormat("yyyy-MM-dd").format(new Date());
    SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd");
    Date dtDate = df.parse(endTime);

    //设置抓3天数据,防止job停止漏一天
    Calendar calendar = Calendar.getInstance();
    calendar.setTime(dtDate);
    calendar.add(Calendar.DAY_OF_MONTH, -3);
    dtDate = calendar.getTime();

    String startTime = new SimpleDateFormat("yyyy-MM-dd").format(dtDate);
    args.put("currentPage", "1");
    args.put("cnnAppCondition.startDate", startTime);
    args.put("cnnAppCondition.endDate", endTime);

    def logList = [];
    def rsList = [];
    HttpUtil httpUtil = HttpUtil.newHttp();
    String newCookie = loginOnAndGetCookie("",0)
    if (newCookie=="") logList.add("cookie get fail!!!")
    httpUtil.addHeaders("Cookie", newCookie);
    Map<String, String> configs = new HashMap<>();

    String content = httpUtil.post(url, args, configs).content;
    Document doc = Jsoup.parse(content, url);

    def element = doc.select("table").first();
    System.out.println(element);
    if (element != null) {
        def els = element.select("tr");
        for (i in 1..<els.size()) {
            Elements ele = els.get(i).select("td");
            rsList.add(["numNo" : ele.get(1).text().toString()
                        , "date"    : ele.get(2).text().toString()
                        , "name"    : ele.get(3).text().toString()
                        , "category": ele.get(5).text().toString()]);
        }
    } else {
        logList.add("element null!!!")
    }
    return JSON.toJSONString(["rs": rsList, "log": logList]);
}
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值