selenium自动登入爬取中国气象数据网

最新推荐文章于 2024-01-25 05:48:44 发布

Five?seven

最新推荐文章于 2024-01-25 05:48:44 发布

阅读量572

点赞数

分类专栏： Python 文章标签： python opencv window

本文链接：https://blog.csdn.net/weixin_45953214/article/details/112834890

版权

selenium自动登入爬取中国气象数据网

记录主要遇到的难点与好的参考文献

记录主要遇到的难点与好的参考文献

url = ‘http://data.cma.cn’

难点：
（1）验证码采用pytesseract库识别，正确率太低，如何提升
（2）验证码输入错误时，弹出alert，如何对状态进行判定，从而结束循环
（3）cookie登入
（4）需要点击js,所以选择selenium

全部用到的库

import sys
import time
import json
from selenium import webdriver
from PIL import Image
import pytesseract
import re
from selenium.common.exceptions import NoAlertPresentException
import csv

pytesseract识别验证码

1.tesseract需要添加环境变量，不想添加的话可以直接

pytesseract.pytesseract.tesseract_cmd = 'c://Program Files (x86)//Tesseract-OCR//tesseract'
testdata_dir_config = '--tessdata-dir "C://Program Files (x86)/Tesseract-OCR/tessdata"'

2,识别前对图像进行，二值化，去噪等处理

3.自己进行训练方法——tesseract-ocr

tesseract-ocr使用以及训练方法

4.识别验证码部分整个代码

#识别验证码
class img_code(object):
    def __init__(self,browser):
        self.brow``ser = browser

    def run(self):
        self.get_img(self.browser)
        image = Image.open('code.png')
        img = image.convert("L")  # 转灰度
        out = self.processing_image(img, 69)
        out = self.cut_noise(out)
        pytesseract.pytesseract.tesseract_cmd = 'c://Program Files (x86)//Tesseract-OCR//tesseract'
        testdata_dir_config = '--tessdata-dir "C://Program Files (x86)/Tesseract-OCR/tessdata"'
        code0 = pytesseract.image_to_string(out, lang='font', config=testdata_dir_config)
        code = ''.join(re.findall('\w+', code0))
        self.code = code
        return  code


    def get_img(self,browser):
        img_code = browser.find_element_by_xpath('//*[@id="yw0"]')
        img_code.screenshot('code.png')

    def processing_image(self,image, threshold):

        pixdata = image.load()
        w, h = image.size
        #         threshold = 160  # 该阈值不适合所有验证码，具体阈值请根据验证码情况设置
        # 遍历所有像素，大于阈值的为黑色
        for y in range(h):
            for x in range(w):
                if pixdata[x, y] < threshold:
                    pixdata[x, y] = 0
                else:
                    pixdata[x, y] = 255
        return image

    def cut_noise(self,image):

        rows, cols = image.size  # 图片的宽度和高度
        change_pos = []  # 记录噪声点位置

        # 遍历图片中的每个点，除掉边缘
        for i in range(1, rows - 1):
            for j in range(1, cols - 1):
                # pixel_set用来记录该店附近的黑色像素的数量
                pixel_set = []
                # 取该点的邻域为以该点为中心的九宫格
                for m in range(i - 1, i + 2):
                    for n in range(j - 1, j + 2):
                        if image.getpixel((m, n)) != 1:  # 1为白色,0位黑色
                            pixel_set.append(image.getpixel((m, n)))

                # 如果该位置的九宫内的黑色数量小于等于4，则判断为噪声
                if len(pixel_set) <= 4:
                    change_pos.append((i, j))

        # 对相应位置进行像素修改，将噪声处的像素置为1（白色）
        for pos in change_pos:
            image.putpixel(pos, 1)

        return image  # 返回修改后的图片

弹窗状态的判定

遇到了两种，一种是登录时就在本页面上弹窗，另一种是点开新页面出现弹窗且页面处于正在加载的状态

第一种弹窗

以是否出现错误来判断是否出现弹窗。

    def alert_is_present(self,browser):
        """Returns whether an alert is present"""
        try:
            browser.switch_to.alert
            return True
        except NoAlertPresentException:
            return False

第二种弹窗

思考了很久，怎么都不行。最后发现切换窗口可以自动关闭。。

try:
    browser.switch_to.window(browser.window_handles[0])
    browser

最低0.47元/天解锁文章

Five?seven

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
2
评论
selenium自动登入爬取中国气象数据网

selenium自动登入爬取中国气象数据网记录主要遇到的难点与好的参考文献全部用到的库pytesseract识别验证码1.tesseract需要添加环境变量，不想添加的话可以直接2,识别前对图像进行，二值化，去噪等处理3.自己进行训练方法——tesseract-ocr4.识别验证码部分整个代码弹窗状态的判定第一种弹窗第二种弹窗cookie完整代码如下：记录主要遇到的难点与好的参考文献url = ‘http://data.cma.cn’难点：（1）验证码采用pytesseract库识别，正确率太低，如
复制链接

扫一扫