七动态加载数据处理

最新推荐文章于 2023-03-28 14:32:06 发布

MisterClown

最新推荐文章于 2023-03-28 14:32:06 发布

阅读量290

点赞数

分类专栏： python爬虫文章标签： python

本文链接：https://blog.csdn.net/MisterClown/article/details/109689273

版权

python爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1 selenium模块的基本使用

1.1简介

selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题 selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等，来拿到网页渲染之后的结果，可支持多种浏览器

1.2 环境安装

下载安装selenium：pip install selenium
下载浏览器驱动程序：http://chromedriver.storage.googleapis.com/index.html
查看驱动和浏览器版本的映射关系：http://blog.csdn.net/huilan_same/article/details/51896672

1.3 简单案例

1 .数据爬取

#2020-11-12
#爬取药监总局化妆品公司名称
from selenium import webdriver
from time import sleep
from lxml import etree
#创建一个浏览器对象 会自动打开浏览器
bro = webdriver.Chrome(executable_path ='chromedriver.exe')
#发送一个url请求
bro.get('http://scxk.nmpa.gov.cn:81/xk/')
#获取请求数据页面
page_text = bro.page_source

tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="gzlist"]/li')
for li in li_list:
    name = li.xpath('./dl/@title')[0]
    print(name)
sleep(5)
bro.quit() #过5秒浏览器退出
2 senlenium 自动化操作

```python
rom selenium import webdriver
from lxml import etree
from time import sleep

bro = webdriver.Chrome(executable_path='./chromedriver.exe')
url = 'https://www.taobao.com/'
bro.get(url=url)
#定位搜索框
ser_imput = bro.find_element_by_xpath('//*[@id="q"]')
#ser_imput = bro.find_element_by_id('q')
#搜索框输入值
ser_imput.send_keys('华为')
#执行一组js程序 滚动滚轮一屏
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)

#定位搜索按钮
ser_btn = bro.find_element_by_xpath('//*[@id="J_TSearchForm"]/div[1]/button')
#点击搜索按钮发送请求
ser_btn.click()

bro.get('https://www.baidu.com/')
sleep(2)
#回退
bro.back()
#前进
bro.forward()

sleep(5)
bro.quit()

1.4 动作链和iframe

在上面的实例中，一些交互动作都是针对某个节点执行的。比如，对于输入框，我们就调用它的输入文字和清空文字方法；对于按钮，就调用它的点击方法。其实，还有另外一些操作，它们没有特定的执行对象，比如鼠标拖曳、键盘按键等，这些动作用另一种方式来执行，那就是动作链。
比如，现在实现一个节点的拖曳操作，将某个节点从一处拖曳到另外一处，可以这样实现：

from selenium import webdriver
from lxml import etree
from time import sleep
from selenium.webdriver import ActionChains
bro = webdriver.Chrome(executable_path='./chromedriver.exe')

bro.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
#若果定位标签是存在于ifname标签中 那么必须通过如下操作进行标签定位
bro.switch_to.frame('iframeResult')  #切换浏览器标签定位的作用域
#定位标签
div = bro.find_element_by_xpath('//*[@id="draggable"]')

#对div进行一系列的动作操作 动作链
#创建一个动作连对象、
action = ActionChains(bro)
#点击长按指定的div
action.click_and_hold(div)
sleep(3) #等待三秒开始向右移动div
for i in range(5):
    #17表示水平方向 0表示垂直方向  perform()表示立即执行动作链
    action.move_by_offset(17,0).perform()
    sleep(0.3) #每移动一次停顿0.3秒
action.release() #关闭动作连

sleep(5)
bro.quit()

1.5 谷歌无可视化界面+反检测

在使用selenium时会自动打开浏览器进行一系列的自动化操作以及会显示浏览器正在受到自动化的控制如何进行无界面操作和反检测？

from selenium import webdriver
from lxml import etree
from time import sleep
from selenium.webdriver import ActionChains
from selenium.webdriver import Chrome
#实现无可视化
from selenium.webdriver.chrome.options import Options
#实现检测规避
# from selenium.webdriver import ChromeOptions

#创建五可视化对象
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
#创建反检测对象
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)


url = 'https://qzone.qq.com/'

bro = webdriver.Chrome(executable_path='./chromedriver.exe',chrome_options=chrome_options,options=option)
bro.get(url=url)
print(bro.page_source)
sleep(3)
bro.quit()

1.6 基于selenium的案列

1 selenium模拟登录QQ空间

# 2020-11-12 selenium模拟登录QQ空间

from selenium import webdriver
from lxml import etree
from time import sleep
from  selenium.webdriver import ActionChains
from selenium.webdriver import Chrome
#实现无可视化界面
from selenium.webdriver.chrome.options import Options
#实现检测规避
from selenium.webdriver import ChromeOptions

url = 'https://qzone.qq.com/'

bro = webdriver.Chrome(executable_path='./chromedriver.exe')
bro.get(url=url)
#定位点击账号免密登录
bro.switch_to.frame('login_frame')  #切换浏览器标签定位的作用域
click_btn = bro.find_element_by_id('switcher_plogin')
click_btn.click()
#定位账号标签
user_input = bro.find_element_by_id('u')
user_input.send_keys('12313')
sleep(2)
#定位密码标签
pwd_input = bro.find_element_by_id('p')
pwd_input.send_keys('12313')
sleep(2)
#定位登录按钮
login_btn = bro.find_element_by_id('login_button')
login_btn.click()

#iframe = bro.find_element_by_xpath('//*[@id="tcaptcha_iframe"]')  # 找到“嵌套”的iframe
bro.switch_to.frame('tcaptcha_iframe')  # 切换到iframe
# #定位滑块
# div = bro.find_element_by_xpath('//*[@id="tcaptcha_drag_thumb"]')
div = bro.find_element_by_id('tcaptcha_drag_thumb')
#创建一个动作对象
div_action = ActionChains()
#点击长安指定的div
div_action.click_and_hold(div)
#移动div
for i in range(10):
    div_action.move_by_offset(160,0).perform()
    sleep(0.3)
action.release() #关闭动作连

2 selenium模拟登录12306

# 2020-11-14
# @09
from selenium import webdriver
from time import sleep
from PIL import Image
from selenium.webdriver import ActionChains
from selenium.webdriver.support import expected_conditions as EC, wait
import requests
from hashlib import md5
#反检测
from selenium.webdriver import ChromeOptions



class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()




#创建反检测对象
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
# driver = Chrome(options=option)
bro = webdriver.Chrome('./chromedriver.exe',options=option)
bro.maximize_window()
sleep(1)
url = 'https://kyfw.12306.cn/otn/resources/login.html'
bro.get(url=url)
# #反爬措施  解决总是要刷新的问题
script = 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined,});'
bro.execute_script(script)

print(bro.get_window_size())
#定位账号登陆
user_login = bro.find_element_by_xpath('/html/body/div[2]/div[2]/ul/li[2]/a')
user_login.click()
sleep(1)
#save_screenshot就是将当前页面进行截图且保存
bro.save_screenshot('aa.png')
#确定验证码图片对应的左上角和右下角的坐标（裁剪的区域就确定）
code_img_ele = bro.find_element_by_xpath('//*[@id="J-loginImg"]')
location = code_img_ele.location  # 验证码图片左上角的坐标 x,y
print('location:',location)
size = code_img_ele.size  #验证码标签对应的长和宽
print('size:',size)
#左上角和右下角坐标
rangle = (
int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
#至此验证码图片区域就确定下来了
i = Image.open('./aa.png')
code_img_name = './code.png'
#crop根据指定区域进行图片裁剪
frame = i.crop(rangle)
frame.save(code_img_name)
#超级鹰
chaojiying = Chaojiying_Client('xxxx', 'xxxxx', '90936')	#用户中心>>软件ID 生成一个替换 96001
im = open('code.png', 'rb').read()													#本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
print (chaojiying.PostPic(im, 9004)['pic_str'])										#1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()
result = chaojiying.PostPic(im, 9004)['pic_str']

all_list = [] #要存储即将被点击的点的坐标  [[x1,y1],[x2,y2]]

if '|' in result:
    list_1 = result.split('|')
    print(list_1)
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)

else:
    x = int(float(result.split(',')[0]))
    y = int(float(result.split(',')[1]))
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
#遍历列表，使用动作链对每一个列表元素对应的x,y指定的位置进行点击操作
for l in all_list:
    x = l[0]
    y = l[1]
    ActionChains(bro).move_to_element_with_offset(code_img_ele, x, y).click().perform()
    sleep(0.5)
    
#输入用户名
bro.find_element_by_id('J-userName').send_keys('xxxxxxxx')
sleep(1)
#输入密码
bro.find_element_by_id('J-password').send_keys('xxxxxxxx')
sleep(1)
#点击登录按钮
bro.find_element_by_id('J-login').click()
sleep(3)



#滑块验证
div=bro.find_element_by_id('nc_1_n1z')
# 处理提示框
span = bro.find_element_by_xpath('//*[@id="nc_1_n1z"]')
#创建动作连对象
action = ActionChains(bro)
# 点击长按指定的标签
action.click_and_hold(span)
# sleep(3)
# for i in range(5):
#     action.move_by_offset(20,0).perform() #偏移x20像素，y0像素
#     sleep(0.1)
sleep(5)
while True:
    try:
        info = bro.find_element_by_xpath('//*[@id="J-slide-passcode"]/div/span').text
        print(info)
        if info == '哎呀，出错了，点击刷新再来一次':
            bro.find_element_by_xpath('//*[@id="J-slide-passcode"]/div/span/a').click()
            sleep(0.2)
            span = bro.find_element_by_xpath('//*[@id="nc_1_n1z"]')
            action = ActionChains(bro)
            # 点击长按指定的标签
            action.click_and_hold(span).perform()
            action.drag_and_drop_by_offset(span, 400, 0).perform()
            sleep(7)
    except:
        print('ok!')
        break
释放动作链
action.release()

模拟12306遇到的问题

1 裁剪验证码图片位置不对试了好多办法网上说的最多的最大化 bro.maximize_window()也不行最后是显示与缩放的问题
将125%改成100%就Ok了
2 验证码解析问题来各种尝试还是不行最后发现超级鹰需要充值了我…!
3 滑块问题也是刚开始死活不通过总是让刷新加上这两句代码就OK了是一种反爬机制
script =‘Object.defineProperty(navigator,“webdriver”,{get:()=>undefined,});’
bro.execute_script(script)
4 还有就是需要进行反检测

MisterClown

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
七动态加载数据处理

1 selenium模块的基本使用1.1简介selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题 selenium本质是通过驱动浏览器，完全模拟浏览器的操作，比如跳转、输入、点击、下拉等，来拿到网页渲染之后的结果，可支持多种浏览器1.2 环境安装下载安装selenium：pip install selenium下载浏览器驱动程序：http://chromedriver.storage.googleapis.com/i
复制链接

扫一扫