python心得笔记18.10.25-CSDN博客

今天写了一个爬虫,是关于某机构考试题的记录一下思路和要点

2018.10.25

在同事桌子下发现一叠卡

是某培训机构的模拟考试体验卡

突然心血来潮想爬他的所有题库

然后....

这里记录一下整个思路和大致过程,作为回忆

为方便后来者

这里大概用到几个思路

selenium 自动化登录 取cookies

requests + cookies  直接获取html

re 清洗数据

xlwings 汇总到excel




细节部分.....

selenium 自动化登录 取cookies

from selenium import webdriver

driver = webdriver.Chrome()

driver.find_element_by_xpath('//*[@id="card_no"]').clear()
driver.find_element_by_xpath('//*[@id="card_no"]').send_keys(id)

driver.find_element_by_xpath('//*[@id="card_pwd"]').clear()
driver.find_element_by_xpath('//*[@id="card_pwd"]').send_keys(pass_word)

driver.find_element_by_xpath('//*[@id="input1"]').click()

输入账号密码 点登录后

cookies = driver.get_cookies()#获取登录之后的cookies


重点............

cookie = {}
for temp in cookies:
    cookie[temp.get('name')] = temp.get('value')

这个字典才是cookies

第二部个细节...

requests + cookies  直接获取html

import requests

response = requests.get(url, cookies=cookies)# 这个cookies 是上文的cookie字典
response.encoding = 'utf-8'
html = response.text

html 就是页面所有数据

第三个正则部分就是脏活,蛮力干..我用的是re, 至于xpath  目前还很生疏,毕竟是小白,没实战过,这就没去研究

最后一部分是xlwings   的操作 摸索一下就ok了
这里贴出xlwings 的基础用法
https://www.jianshu.com/p/e21894fc5501
我这就不重复粘贴了

感谢python带给我的快乐