前一段时间,爬取一个网站的数据,该网站需要模拟登陆,输入账号,密码,及其简单的验证码,其验证码通过请求获取的验证码是和页面上的不一样,所以想要成功破解验证码,需要利用Selnium截图,然后模拟登陆,输入账号,密码进行模拟登陆。
1.先利用selnium进行截取登陆页面图片,然后定位验证码的位置,进行截图,然后进行验证码破解,具体代码参考如下:
# -*- coding:utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') import urllib from PIL import Image,ImageEnhance import pytesseract import requests from selenium import webdriver def get_image(driver): driver.set_window_size(1400,900) 截取登陆页面的图 driver.get_screenshot_as_file('.//1.png') # 获取指定元素位置 element = driver.find_element_by_id('codePic') left = int(element.location['x']) top = int(element.location['y']) right = int(element.location['x'] + element.size['width']) bottom = int(element.location['y'] + element.size['height']) print left,top,right,bottom # 通过Image处理图像 im = Image.open('.//1.png') im = im.crop((left, top, right, bottom)) filename = "2.png" im.save(filename) return filename threshold = 150 table = [] for i in range(256): if i < threshold: table.append(0) else: table.append(1) def getverify1(name): im = Image.open(name) imgry = im.convert('L') imgry.save('g' + name) out = imgry.point(table, '1') out.save('b' + name) string = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"' im = Image.open('b'+name) # enhancer = ImageEnhance.Contrast(im) # im = enhancer.enhance(6) text = pytesseract.image_to_string(im, config=string) text = text.strip('') text = text.upper() return text def main(driver): im = get_image(driver) date = getverify1(im) print '-----',(date) return date 2.模拟登陆
def login(): driver=webdriver.Chrome() driver.get("url") time.sleep(10) admin=driver.find_element_by_id("j_username") root = driver.find_element_by_id( "j_password_show") captch=driver.find_element_by_id("j_validation_code") admin.send_keys(str('用户名')) root.send_keys(str('密码')) date=main(driver) time.sleep(20) captch.send_keys(date) time.sleep(10) driver.find_element_by_link_text(u"登录").click() time.sleep(5) 3.如果放在centos服务器上,其模拟登陆时利用PhantmJS,在centos下载一个PhantmJS就可以进行简单的破解:
from get_captach import main import requests from PIL import Image,ImageEnhance import pytesseract import json from lxml import etree import update_config from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC def login(): driver = webdriver.PhantomJS(executable_path=r'D:\phantomjs-2.1.1-windows\phantomjs-2.1.1-windows\bin\phantomjs.exe') driver.get("url") time.sleep(10) element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID, 'codePic'))) admin=driver.find_element_by_id("j_username") root = driver.find_element_by_id( "j_password_show") captch=driver.find_element_by_id("j_validation_code") admin.send_keys(str('用户名')) root.send_keys(str('密码')) date=main(driver)