Python爬虫(下)(已完结)

书接上回

Python爬虫(上)https://blog.csdn.net/weixin_44271280/article/details/129676683

五、selenium库

1.selenium简介

1.1 什么是selenium?

  1. Selenium是一个用于wcb应用程序测试的工具。

  1. Selenium测试直接运行在浏览器中,就像真正的用户在操作一样。

  1. 支持通过各种driver (FirfoxDriver,IternetExplorerDriver,OperaDriver,ChromeDriver)驱动

  1. 真实浏览器完成测试。

  1. selenium也是支持无界面浏览器操作的。

1.2 为什么使用selenium?

模拟浏览器功能,自动执行网页中的js代码,实现动态加载

2.selenium的安装与使用

  1. 操作谷歌浏览器驱动下载地址

http://chromedriver.storage.googleapis.com/index.html

  1. 谷歌驱动和谷歌浏览器版本之间的映射表

http://blog.csdn.net/huilan_same/article/details/51896672

  1. 查看谷歌浏览器版本

谷歌浏览器右上角-->帮助-->关于

  1. 安装selenium

在/python/Scripts/下 运行pip install selenium

  1. 导入selenium包

from selenium import webdriver

  1. 创建浏览器操作对象

path = 'chromedriver.exe的地址'

browser = webdriver.Chrome(path)

  1. 访问网站

url = 'https://www.baidu.com'

browser.get(url)

3.元素定位

3.1 根据标签属性的属性值来获取对象

3.1.1 根据id找到对象
from selenium import webdriver

# chromedriver.exe

path = 'chromedriver.exe'

browser = webdriver.Chrome(path)

url = 'https://www.baidu.com'

browser.get(url=url)

button = browser.find_element(by='id',value='su')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="d6d60e6e-f7a0-4170-9360-d9eea5c4c4a4")>

3.3.2 根据xpath语句获取对象
button = browser.find_element(by='xpath',value='//input[@id="su"]')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="d6d60e6e-f7a0-4170-9360-d9eea5c4c4a4")>

3.3.3 根据标签的名字获取对象
input = browser.find_element(by='tag name',value='input')
print(input)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="a777d1fb-8802-41f0-a39b-2e1e1ee1b329")>

3.3.4 根据bs4语法获取标签
button = browser.find_element(by='css selector',value='#su')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="d6d60e6e-f7a0-4170-9360-d9eea5c4c4a4")> 

3.3.5 根据a标签的值获取标签
button = browser.find_element(by='link text',value='新闻')
print(button)
# <selenium.webdriver.remote.webelement.WebElement (session="7bf780ef3a7a01665ebc7dd63bb4309b", element="548094f1-bf3d-412a-9d03-40ee6032fcef")>

4. 访问元素信息

4.1 获取id为su的元素的class属性的值

from selenium import webdriver

path = 'chromedriver.exe'

browser = webdriver.Chrome(path)

url = 'http://www.baidu.com'

browser.get(url=url)

input = browser.find_element(by='id',value='su')
print(input.get_attribute('class'))
# bg s_btn

4.2 获取标签名

print(input.tag_name)
# input

4.3 获取元素文本

button = browser.find_element(by='link text',value='新闻')
print(button.text)
# 新闻

5.交互

5.1 在文本框中输入内容

from selenium import webdriver

path = 'chromedriver.exe'

browser = webdriver.Chrome(path)

url = 'http://www.baidu.com'

browser.get(url=url)

# 获取文本框
input_txt = browser.find_element(by='id',value='kw')

# 输入周杰伦
input_txt.send_keys('周杰伦')

import time
# 睡眠2秒
time.sleep(2)

5.2 点击按钮

# 获取按钮
button_baidu = browser.find_element(by='id',value='su')

# 点击按钮
button_baidu.click()

5.3 滑到网页底部

# 通话运行js代码实现
js_bottom = 'document.documentElement.scrollTop=100000'
browser.execute_script(js_bottom)

5.4 返回上一页

browser.back()

5.5 回退

browser.forward()

5.6 清空输入框

input_txt.clear()

5.7 退出

browser.quit()

5.8 练习 selenium完成自动化百度操作

实现操作:

打开百度->搜索周杰伦->翻到底部->点击下一页->点击回退->点击下一页->清空输入框->退出

from selenium import webdriver

path = 'chromedriver.exe'

browser = webdriver.Chrome(path)

url = 'http://www.baidu.com'

browser.get(url=url)

import time

# 睡眠2秒
time.sleep(2)

# 获取文本框
input_txt = browser.find_element(by='id',value='kw')

# 输入周杰伦
input_txt.send_keys('周杰伦')

time.sleep(2)

# 获取百度一下按钮
button_baidu = browser.find_element(by='id',value='su')

# 点击按钮
button_baidu.click()

time.sleep(2)

# 滑到底部
js_bottom = 'document.documentElement.scrollTop=100000'
browser.execute_script(js_bottom)

time.sleep(2)

# 获取下一页的按钮
button_next = browser.find_element(by='xpath',value='//a[@class="n"]')

# 点击下一页
button_next.click()

time.sleep(2)

# 点击下一页
button_next.click()

time.sleep(2)

# 回到上一页
browser.back()

time.sleep(2)

# 返回
browser.forward()

time.sleep(2)

# 清空输入框
input_txt.clear()

time.sleep(3)

# 退出
browser.quit()

6、Chrome handless 无界面浏览器

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service


def share_browser() :
    path = Service(r'chromedriver.exe')
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    browser = webdriver.Chrome(options=options, service=path)
    return browser


browser = share_browser()
url = 'https://baidu.com'
browser.get(url)
browser.save_screenshot('baidu.png')

六、requests库

安装requests库:在/python/Scripts 目录下 pip install requests

  1. 基本使用 一个类型和六个属性

1.1 一个类型

import requests
url = 'http://www.baidu.com'
response = requests.get(url=url)

print(type(response))
# <class 'requests.models.Response'>
# Response类型

1.2 六个属性

1.2.1 设置响应的编码格式
response.encoding = 'utf-8'

1.2.2 以字符串的形式返回网页的源码
print(response.text)

1.2.3 返回url地址
print(response.url)
# http://www.baidu.com

1.2.4 返回网页的源码(以二进制数据的形式)
print(response.content)

1.2.5 返回响应的状态码
print(response.status_code)
# 200

1.2.6 返回响应头
print(response.headers)
# {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Mon, 20 Mar 2023 12:51:11 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:56 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}

  1. get请求

使用get请求获取百度页面

import requests

url = 'http://www.baidu.com/s?'

headers = {
    'Cookie':'BIDUPSID=BC221A7A6E195D713FF461A23C0C6C03; PSTM=1660552946; BDUSS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BDUSS_BFESS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BAIDUID=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; BD_UPN=12314753; BAIDUID_BFESS=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; ZFY=lj:AKw8fsMYoAmQP38mmIPNebgbXLMwDG0N9ODwS5quI:C; sug=3; sugstore=0; ORIGIN=2; bdime=0; BD_HOME=1; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BD_CK_SAM=1; PSINO=7; delPer=0; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BAIDU_WISE_UID=wapp_1679361289881_69; __bid_n=186718ae070167b4ec4207; FPTOKEN=HX9cJIAC1OsbehNXqos0lPZ/kLt5B6mSgi7zWXntOJnl4A6kF+y3jGiOQawhEPMcs0TweODc71y7H75PgfY2tyb6Y7AkqPGxfww+2VM3N0z0zh9sw82VrvG39nWSW2EcLh/vXTruFA3LOas/iF5Q8S/m5GlcL1iM6R8etzBqT2Ys+FalyyytWGJ5b8rjXk6DhUoPoUkxJMce9V2EjezsO+t2k+dE+LBWZRrHar8xANB/0VHMEICLaOOEueHnlgCOefbn0QNWyxeyZx9/8gSnwW0lBDCWx/APlOn9pFHsbmPHMg86HlOOIyivCnnJobN5XyCFkC3I2WMu3DHPFOBUGPNO8ayjKswvhDj9J84plb+A+hmwofCUzoQnNCZlMuFa3gH6hSPkYw5TjTnpcH/SoQ==|29Qf0FbXYRIMwpn44o4mB5+0O6zasymNL74pMZL59jw=|10|ae92cbfed415aa2e64c3be6f636f8677; arialoadData=false; COOKIE_SESSION=5_0_8_9_3_21_1_2_8_8_1_3_427489_0_21_0_1679296227_0_1679296206%7C9%23276178_339_1678349345%7C9; shifen[797184_97197]=1679361811; shifen[1720973_97197]=1679361815; BCLID=11675428486066043407; BCLID_BFESS=11675428486066043407; BDSFRCVID=Km4OJeCT5G09-drfyDIHuUzbzOVXzVJTTPjcTR5qJ04BtyCVcmimEG0PtOg3cu-M_EGSogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=Km4OJeCT5G09-drfyDIHuUzbzOVXzVJTTPjcTR5qJ04BtyCVcmimEG0PtOg3cu-M_EGSogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJAO_C82tIP3fP36q45HMt00qxby26nBMgn9aJ5nQI5nhbvb3fnt2f3LbpoPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h5M55rbKl0MLPbceIOn5DcDjJL73xnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFRjj8KjjbBDHRf-b-XKD600PK8Kb7Vbp5gqfnkbft7jttjqCrb-Dc8KIjIbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSROzMtcpQT8r5-nMJ-JtHmjdKl3Rab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjDefnIe_Ityf-3bfTrP-trf5DCShUFs5CuOB2Q-5M-a3KJOVU5ObRrUyJ_03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbpK-bH5gTxoUJ2Bb05HPKz-6oh3hKebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hIKmD6_bj6oM5pJfetjK2CntsJOOaCvvDDbOy4oT35L1DauLKnjhMmnAaP5GMJo08UQy2ljk3h0rMxbnQjQDWJ4J5tbX0MQjDJTzQft20b0gbNb2-CruX2Txbb7jWhvBhl72y5u2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjH62btt_JJueoKnP; H_BDCLCKID_SF_BFESS=tJAO_C82tIP3fP36q45HMt00qxby26nBMgn9aJ5nQI5nhbvb3fnt2f3LbpoPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h5M55rbKl0MLPbceIOn5DcDjJL73xnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDFRjj8KjjbBDHRf-b-XKD600PK8Kb7Vbp5gqfnkbft7jttjqCrb-Dc8KIjIbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSROzMtcpQT8r5-nMJ-JtHmjdKl3Rab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjDefnIe_Ityf-3bfTrP-trf5DCShUFs5CuOB2Q-5M-a3KJOVU5ObRrUyJ_03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbpK-bH5gTxoUJ2Bb05HPKz-6oh3hKebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hIKmD6_bj6oM5pJfetjK2CntsJOOaCvvDDbOy4oT35L1DauLKnjhMmnAaP5GMJo08UQy2ljk3h0rMxbnQjQDWJ4J5tbX0MQjDJTzQft20b0gbNb2-CruX2Txbb7jWhvBhl72y5u2QlRX5q79atTMfNTJ-qcH0KQpsIJM5-DWbT8IjH62btt_JJueoKnP; RT="z=1&dm=baidu.com&si=13a1c330-42ed-4127-b6f2-f24a8d4e32ad&ss=lfhkckh4&sl=g&tt=1ixr&bcn=https%3A%2F%2Ffclog.baidu.com%2Flog%2Fweirwood%3Ftype%3Dperf&ld=bjd8&ul=bu3x&hd=bu4e"; BA_HECTOR=2g8g2k8k04ala08l200l24bc1i1ikvd1n; H_PS_PSSID=38185_36550_38354_38366_37862_38170_38289_38246_36804_38261_37937_38312_38382_38285_38041_26350_37958_22159_38282_37881; H_PS_645EC=e73839RPUo9xfSdEWH%2BgZ2bKX7tRqQHy3xBHs9JSckkQDx6S5tfVDzVkc1R94pTfrrr5; BDSVRTM=397; baikeVisitId=0486a65c-3d0d-4c44-9543-410350269b72',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'
}

data = {
    'wd':'北京'
}


# url 请求路径
# params 参数
# kwargs 字典
response = requests.get(url=url,params=data,headers=headers)

content = response.text


with open(f'baidu.html','w',encoding='utf-8') as fp :
    fp.write(content)

  1. post请求

使用post请求爬取百度翻译结果

import requests

url = 'https://fanyi.baidu.com/v2transapi?from=en&to=zh'

headers = {
    'Cookie':'BIDUPSID=BC221A7A6E195D713FF461A23C0C6C03; PSTM=1660552946; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; BDUSS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BDUSS_BFESS=h2dzlpZUJuWTNjM25LV1ctZHpKQXVjREhGc2lYb2VxZFV6Y2RvZ3lvNTNybFJqRVFBQUFBJCQAAAAAAAAAAAEAAAALqAno1rvKo7vY0uR3AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHchLWN3IS1jdE; BAIDUID=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; APPGUIDE_10_0_2=1; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1677725867,1679036239,1679280393; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BA_HECTOR=8g0h040g0h8l0404250124cu1i1it081n; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; PSINO=7; BAIDUID_BFESS=009B581D121BAB0A028000A422796207:SL=0:NR=10:FG=1; ZFY=lj:AKw8fsMYoAmQP38mmIPNebgbXLMwDG0N9ODwS5quI:C; H_PS_PSSID=38185_36550_38354_38366_37862_38170_38289_38246_36804_38261_37937_38312_38382_38285_38041_26350_37958_22159_38282_37881; BDSFRCVID=JJ-OJexroG07VWbfyqX-uUzbz_weG7bTDYrEOwXPsp3LGJLVFe3JEG0Pts1-dEu-S2OOogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BDSFRCVID_BFESS=JJ-OJexroG07VWbfyqX-uUzbz_weG7bTDYrEOwXPsp3LGJLVFe3JEG0Pts1-dEu-S2OOogKKy2OTH90F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; BCLID=8097868108805834232; BCLID_BFESS=8097868108805834232; H_BDCLCKID_SF=tbIJoDK5JDD3fP36q45HMt00qxby26n45Pj9aJ5nQI5nhKIzbb5t2f3LQloPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h52aC5LKl0MLPbceIOn5DcYBUL10UnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDF4DTAhDT3QeaRf-b-XKD600PK8Kb7VbnDzeMnkbft7jttjahJPBKc8KpThbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSR3TK4OpQT8r5-nMJ-JtHCutLxbCab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjttJnut_KLhf-3bfTrP-trf5DCShUFsB-uJB2Q-5M-a3KtBKJb4bRrUyfk03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbp5-r0amTxoUJ2Bb05HPKzXqnpQptebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hD89DjKKD6PVKgTa54cbb4o2WbCQXMOm8pcN2b5oQTOW3RJaKJcDM6Qt-4ctMf7beq06-lOUWJDkXpJvQnJjt2JxaqRC3JK5Ol5jDh3MKToDb-oteltHB2Oy0hvcMCocShPwDMjrDRLbXU6BK5vPbNcZ0l8K3l02V-bIe-t2XjQhDH-OJ6DHtJ3aQ5rtKRTffjrnhPF3yxTDXP6-hnjy3b4f-f3t5tTao56G3J6D2l4Wbttf5q3Ry6r42-39LPO2hpRjyxv4Q4Qyy4oxJpOJ-bCL0p5aHx8K8p7vbURv2jDg3-A8JU5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-jBRIE3-oJqC_-MKt93D; H_BDCLCKID_SF_BFESS=tbIJoDK5JDD3fP36q45HMt00qxby26n45Pj9aJ5nQI5nhKIzbb5t2f3LQloPhtjEBacvXfQFQUbmjRO206oay6O3LlO83h52aC5LKl0MLPbceIOn5DcYBUL10UnMBMn8teOnaIIM3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDF4DTAhDT3QeaRf-b-XKD600PK8Kb7VbnDzeMnkbft7jttjahJPBKc8KpThbx50H4cO3l703xI73b3B5h3NJ66ZoIbPbPTTSR3TK4OpQT8r5-nMJ-JtHCutLxbCab3vOPI4XpO1ej8zBN5thURB2DkO-4bCWJ5TMl5jDh3Mb6ksD-FtqjttJnut_KLhf-3bfTrP-trf5DCShUFsB-uJB2Q-5M-a3KtBKJb4bRrUyfk03q6ahfRpyTQpafbmLncjSM_GKfC2jMD32tbp5-r0amTxoUJ2Bb05HPKzXqnpQptebPRih6j9Qg-8opQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0hD89DjKKD6PVKgTa54cbb4o2WbCQXMOm8pcN2b5oQTOW3RJaKJcDM6Qt-4ctMf7beq06-lOUWJDkXpJvQnJjt2JxaqRC3JK5Ol5jDh3MKToDb-oteltHB2Oy0hvcMCocShPwDMjrDRLbXU6BK5vPbNcZ0l8K3l02V-bIe-t2XjQhDH-OJ6DHtJ3aQ5rtKRTffjrnhPF3yxTDXP6-hnjy3b4f-f3t5tTao56G3J6D2l4Wbttf5q3Ry6r42-39LPO2hpRjyxv4Q4Qyy4oxJpOJ-bCL0p5aHx8K8p7vbURv2jDg3-A8JU5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-jBRIE3-oJqC_-MKt93D; ab_sr=1.0.1_MTVjY2NjMmVjZWI3NDkwMTQwY2IyNjhlODUzODZmMWVmZTY5YTUyNzQyYTkxNjFkZWJkODc4ZGM4YjQ4YjRhYTE1NDQ1ZmQ2NDdjMmVlNGRhMmQ4NzMwZTg0YjYwNGRmMzI4NGUyY2RiYjc2MjgwODM2ZTZiMDEzYmRjNDM2ZjY2NDljYzI4ZWE4OGRhYThiYzc2M2YzNmYzNWYyMzE3ZDZkNjUyMzIwMDE4YjI2ZTI0MTg1ODg5M2ZiZDAzOTFi; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1679391374',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
}

data = {
    'from':'en',
    'to':'zh',
    'query':'lover',
    'simple_means_flag':'3',
    'sign':'779242.983259',
    'token':'fc3e9c7437f19e66e7c851920a776f05',
    'domain':'common',
}

# url 请求路径
# data 参数
# kwargs 字典
response = requests.post(url=url,data=data,headers=headers)
response.encoding = 'utf-8'

content = response.text

import json

obj = json.loads(content)

print(obj)

注意:

(1)post请求是不需要编解码

(2)post请求的参数是data

(3)不需要请求对象的定制

  1. cookie登录古诗文网页

通过错误登录的方式查找接口,找到需要的参数

__VIEWSTATE: Ex1PVx+h6L9R6etKhj960tstXW3+ejGnUvV/SODQ04iUB5sJ15hNVPO6ZP33bPkhe2LPrNLTKSY+pPgxVCNSXIg9xyl0lpOo1XaUvoCoXtZaRnKctEI+vbbrfaYLQ+RRKmkm8NK5exHqSzhok/cQB/Ch5wQ=
__VIEWSTATEGENERATOR: C93BE1AE
from: http://so.gushiwen.cn/user/collect.aspx
email: 19909602290
pwd: wu111111
code: M4j7
denglu: 登录

观察参数发现 __VIEWSTATE__VIEWSTATEGENERATORcode为变量

难点:

(1)未知的值 __VIEWSTATE,__VIEWSTATEGENERATOR

解决方案:

一般情况下看不到的数据都是隐藏在页面源码中,可以通过获取页面源码获取到隐藏的值

(2)验证码 code

解决方案:

获取验证码图片 --》 观察或图像解析获取验证码code

import requests

# 登录页面url地址
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',
}

response = requests.get(url=url,headers=headers)

# 获取网页源码
content = response.text

# 解析页面源码 获取__VIEWSTATE,__VIEWSTATEGENERATOR 
from bs4 import BeautifulSoup

soup = BeautifulSoup(content,'lxml')

# 获取__VIEWSTATE
viewstate = soup.select('#__VIEWSTATE')[0].attrs.get('value')

# 获取__VIEWSTATEGENERATOR
viewstategenerator = soup.select('#__VIEWSTATEGENERATOR')[0].attrs.get('value')


# 获取验证码图片
codeimg = soup.select('#imgCode')[0].attrs.get('src')
codeimgUrl = 'https://so.gushiwen.cn' + codeimg

# 获取验证码图片后,下载到本地,然后观察验证码,观察之后然后控制台输入这个验证码
# import urllib.request

# urllib.request.urlretrieve(url=codeimgUrl,filename='requests库\code.jpg')
# 使用urllib会在获取下载图片时请求一次验证码,下方再次登录页面请求时会生成新的验证码,获取到的验证码失效
# 结论:urllib.request.urlretrieve方法不可用


# session()方法
# requests里面有一个方法 sessoin() 通过session的返回值 就能使用请求变成一个对象
session = requests.session()

# 验证码的url的内容
response_code = session.get(codeimgUrl)
# 注意此时要用二进制数据,因为此时使用的是图片的下载,而图片的生成使用的是二进制
content_code = response_code.content

# wb 的模式就是将二进制数据写入到文件
with open('requests库\code.jpg','wb') as fp :
    fp.write(content_code)

# 肉眼观察获取验证码
code_name = input('请输入你的验证码:')

# 点击登录请求的url
url_post = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'

data_post = {
    '__VIEWSTATE':viewstate,
    '__VIEWSTATEGENERATOR':viewstategenerator,
    'from:http':'//so.gushiwen.cn/user/collect.aspx',
    'email':'19909602290',
    'pwd':'wu101042',
    'code':code_name,
    'denglu':'登录',
}

response_post = session.post(url=url,headers=headers,data=data_post)

content_post = response_post.text

with open('requests库\gushiwen.html','w',encoding='utf-8') as fp :
    fp.write(content_post)

  1. 超级战鹰平台机器识别图片验证码

参考网页:https://zhuanlan.zhihu.com/p/558563219

七、scrapy框架

定义:Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框卖

可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。

  1. scrapy的安装

在/python/Scripts 目录下 pip install scrapy

容易出错的点:

报错:building 'twisted.test.raiser'extension

解决方法:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

在该页面下载相应报错的twisted包

cp:python版本

win:windows版本

  1. scrapy项目的创建与运行

2.1 创建爬虫项目

scrapy startproject 项目名

注意:项目名不允许使用数字开头,也不能包含中文

2.2 创建爬虫文件

在项目文件-> spiders下使用cmd创建爬虫文件

注意:创建爬虫文件命名

scrapy genspider 爬虫文件的名字 要爬取的网页

eg:scrapy genspider baidu http://www.baidu.com

2.3 运行爬虫代码

scrapy crawl 爬虫的名字(name)

直接运行不能正常执行,是因为默认设置遵守robots协议,不能爬取部分网页

解决方法:

1.在..\爬虫\scrapy_baidu\scrapy_baidu\settings.py中注释掉ROBOTSTXT_OBEY = True 取消遵守robots协议

2.运行 scrapy crawl 爬虫的名字(name)

3.scrapy项目的结构

项目名字

项目名字

spiders文件夹

__init__.py(存储的是爬虫文件)

__init__.py

items文件(定义数据结构的地方 爬取的数据都包含哪些)

middleware(中间件 代理)

settings(配置文件 roots协议 ua定义等)

4.response的属性和方法

获取字符串数据

content = response.text

获取二进制数据

content = response.body

使用xpath方法来解析response中的内容

response.xpath()

提取selector对象的data属性值

response.extract()

提取selector列表中的第一个数据

response.extract_first()

5.练习 下载汽车之家车辆信息

# 创建项目文件
scrapy startproject scrapy_qczj
# 创建爬虫文件
scrapy genspider bm https://car.autohome.com.cn/price/brand-15.html
# 运行爬虫文件
scrapy crawl bm

bm.py内容如下:

import scrapy

class BmSpider(scrapy.Spider):
    name = "bm"
    allowed_domains = ["https://car.autohome.com.cn/price/brand-15.html"]
    # 注意当网页的后缀为html结尾,则网址最后不能加/
    start_urls = ["https://car.autohome.com.cn/price/brand-15.html"]

    def parse(self, response):
        bm_imgUrl = response.xpath('//div[@class="list-cont-img"]//img/@src')

        bm_name = response.xpath('//div[@class="list-cont-main"]/div[@class="main-title"]/a/text()')

        bm_price = response.xpath('//div[@class="list-cont-main"]/div[@class="main-lever"]//span[@class="font-arial"]/text()')
        print('=================')

        # print(bm_price.extract())

        # print(len(bm_imgUrl))

        for i in range(len(bm_imgUrl)) :
            url = 'http:' + bm_imgUrl[i].extract()
            name = bm_name[i].extract()
            price = bm_price[i].extract()
            
            with open('D:/study/Python/爬虫/scrapy框架/qczj/qczj/spiders/cars.txt','a',encoding='utf-8') as fp :
                fp.write('网址:' + url + '\n' + '系列:' + name + '\n' +'价格:' + price + '\n\n')

6.yield 多管道分页下载 当当网数据

6.1 文件的用途

pipelines.py 用于下载数据,可定义多管道下载

item.py 用于定义需要下载数据的数据结构

6.2 定义数据结构并爬取数据

1.创建爬虫文件

scrapy startproject dangdang

2.创建爬虫文件

在spiders目录下执行

scrapy genspider dang http://e.dangdang.com/list-AQQG-dd_sale-0-1.html

3.编写文件

itme.py文件

class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 需要下载数据的数据结构
    # 图片
    src = scrapy.Field()

    # 名字
    name = scrapy.Field()

    # 价格
    price = scrapy.Field()

dang.py文件

注意:有的seletor的对象 都可以再次调用xpath方法

import scrapy

class DangSpider(scrapy.Spider):
    name = "dang" 
    allowed_domains = ["http://e.dangdang.com/list-AQQG-dd_sale-0-1.html"]
    start_urls = ["http://e.dangdang.com/list-AQQG-dd_sale-0-1.html"]

    def parse(self, response):
        print('=====================')
        # 所有的seletor的对象 都可以再次调用xpath方法
        bookcover = response.xpath('//div[@id="book_list"]//span[@class="bookcover"]')
        bookinfo = response.xpath('//div[@id="book_list"]//div[@class="bookinfo"]')
        for i in range(len(bookcover)) :
                src = bookcover.xpath('./img[not(@class="promotion_label")]/@src')[i].get()
                name = bookinfo.xpath('./div[@class="title"]/text()')[i].get()
                price = bookinfo.xpath('./div[@class="price"]/span/text()')[i].get()
                print(src + '\n' + name + '\n' + price + '\n') 

6.3 多管道封装

dang.py文件

import scrapy

class DangSpider(scrapy.Spider):
    name = "dang" 
    allowed_domains = ["http://e.dangdang.com/list-AQQG-dd_sale-0-1.html"]
    start_urls = ["http://e.dangdang.com/list-AQQG-dd_sale-0-1.html"]

    def parse(self, response):
        print('=====================')
        # 所有的seletor的对象 都可以再次调用xpath方法
        bookcover = response.xpath('//div[@id="book_list"]//span[@class="bookcover"]')
        bookinfo = response.xpath('//div[@id="book_list"]//div[@class="bookinfo"]')
        for i in range(len(bookcover)) :
                src = bookcover.xpath('./img[not(@class="promotion_label")]/@src')[i].get()
                name = bookinfo.xpath('./div[@class="title"]/text()')[i].get()
                price = bookinfo.xpath('./div[@class="price"]/span/text()')[i].get()

                # 定义对象使用item文件中定义数据结构的函数
                # 导入item.py中的函数
                from dangdang.items import DangdangItem
                book = DangdangItem(src = src,name = name,price = price)

                # 获取一个book并将book交给pipelines
                yield book

4.编写管道操作文件

如果想使用管道,就必须在setting中开启管道

setting.py

# 解开注释
ITEM_PIPELINES = {
    # 管道是有优先级的,范围是1到1000,值越小优先级越高
   "dangdang.pipelines.DangdangPipeline": 300,
   "dangdang.pipelines.DangdangImgPipeline": 301
}

pipelines.py文件

from itemadapter import ItemAdapter
import urllib.request

# 如果想使用管道,就必须在setting中开启管道
# 下载书籍信息
class DangdangPipeline:
    # 打开文件
    def processopen_item(self, item, spider):
        self.fp = open('D:/study/Python/爬虫/scrapy框架/dangdang/dangdang/spiders/book.json','w',encoding='utf-8')

    # item就是yield后面的book对象
    def process_item(self, item, spider):
# D:\study\Python\爬虫\scrapy框架\dangdang\dangdang\spiders\dang.py
        self.fp.write(str(item))
        print('************')
        return item
    # 关闭文件
    def processclose_item(self, item, spider):
        self.fp.close()

# urlretrieve下载封面图片
class DangdangImgPipeline:
    def process_item(self, item, spider):
        url = item.get('src')
        filename = 'D:/study/Python/爬虫/scrapy框架/dangdang/dangdang/BookImg/' + item.get('name') + '.jpg'
        urllib.request.urlretrieve(url=url,filename=filename)
        return item

6.4 多页数据下载

dang.py文件

import scrapy
from dangdang.items import DangdangItem


class DangSpider(scrapy.Spider):
    name = "dang" 
    # 如果是多页下载的话 必须调整allowed_domains的范围 一般情况下只写域名
    allowed_domains = ["e.dangdang.com"]
    start_urls = ["http://e.dangdang.com/list-AQQG-dd_sale-0-1.html"]

    base_url = 'http://e.dangdang.com/list-AQQG-dd_sale-0-'
    page = 1
    def parse(self, response):
        # pipelines 下载数据
        # item 定义数据结构  
        print('=====================')
        # 所有的seletor的对象 都可以再次调用xpath方法
        bookcover = response.xpath('//div[@id="book_list"]//span[@class="bookcover"]')
        bookinfo = response.xpath('//div[@id="book_list"]//div[@class="bookinfo"]')
        for i in range(len(bookcover)) :
                src = bookcover.xpath('./img[not(@class="promotion_label")]/@src')[i].get()
                name = bookinfo.xpath('./div[@class="title"]/text()')[i].get()
                price = bookinfo.xpath('./div[@class="price"]/span/text()')[i].get()
                # print(src + '\n' + name + '\n' + price + '\n') 

                book = DangdangItem(src = src,name = name,price = price)

                    # 获取一个book就将book交给pipelines
                yield book
        

        # 多页下载
        if self.page < 2 :
            self.page = self.page + 1

            url = self.base_url + str(self.page) + '.html'

            # scrapy.Request就是scrpay的get请求
            yield scrapy.Request(url=url,callback=self.parse)

7.CrawSpider 链接提取器

7.1 CrawSpider项目的安装

  1. 创建项目 scrapy startproject 项目名称

  1. 跳转到spider文件夹下

  1. 创建爬虫文件

scrapy genspider -t crawl 爬虫文件名字 爬取的域名

7.2 练习 读书网数据入库与连接跟进

items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DushuwangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    src = scrapy.Field()

dushu.py文件

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dushuwang.items import DushuwangItem


class DushuSpider(CrawlSpider):
    name = "dushu"
    allowed_domains = ["www.dushu.com"]
    start_urls = ["https://www.dushu.com/book/1087_1.html"]

    # 链接跟进
    rules = (Rule(LinkExtractor(allow=r"/book/1087_\d+\.html"), 
                                callback="parse_item", 
                                follow=True),
                                )

    def parse_item(self, response):
        img_list = response.xpath('//div[@class="bookslist"]/ul//img')
        for img in img_list:
            name = img.xpath('./@alt').extract()
            src = img.xpath('./@data-original').extract()

            book = DushuwangItem(name=name,src=src)
            yield book

settings.py文件

DB_HOST ='localhost'
DB_POST = 3306
DB_USER = 'root'
DB_PASSWORD = '****'
DB_NAME = 'python'
DB_CHARSET = 'utf8'


# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "dushuwang.pipelines.DushuwangPipeline": 300,
   "dushuwang.pipelines.MYSQLpiplines":301
}

piplines.py文件

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.utils.project import get_project_settings
import pymysql

class DushuwangPipeline:
    def open_spider(self, spider):
        self.fp = open('D:/study/Python/爬虫/scrapy框架/dushuwang/dushuwang/spiders/book.json','w',encoding='utf-8')
        

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item
    
    def close_spider(self,spider):
        self.fp.close()
        
# 数据入库
class MYSQLpiplines:
    def open_spider(self, spider):
        print("***************")
        settings = get_project_settings()
        self.host = settings['DB_HOST']
        self.port = settings['DB_POST']
        self.user = settings['DB_USER']
        self.password = settings['DB_PASSWORD']
        self.name = settings['DB_NAME']
        self.charset = settings['DB_CHARSET']

        self.conn = pymysql.connect(
            host=self.host,
            port=self.port,
            user=self.user,
            password=self.password,
            db=self.name,
            charset=self.charset
        )
        self.cur = self.conn.cursor()
        
    
    def process_item(self, item, spider):
        sql = 'insert into book_info(name,src) values("{}","{}")'.format(item['name'],item['src'])
        # 执行sql
        self.cur.execute(sql) 
        # # 提交
        self.conn.commit()
        return item
    
    def close_spider(self,spider):
        self.cur.close()
        self.conn.close()

7.3 日志信息与日志等级

7.3.1 日志级别:

CRITICAL:严重错误

ERROR:一般错误

WARNING:警告

INFO:一般信息

DEBUG:调试信息

默认的日志等级是DEBUG

只要出现了DEBUG或者DEBUG以上等级的日志,那么这些日志将会打印

7.3.2 设置日志等级

settings.py 文件

LOG_LEVEL = 'WAGNING'

一般修改调试等级,而是将日志放入日志文件中

7.3.3 设置日志文件

settings.py 文件

LOG_FILE = 'logdemo.log'

8.scrapy的post请求

在post请求 如果没参数 那么这个请求将没有任何意义

所以这个start_urls 也就没有用,parse 方法也就没有用

在post请求中则使用start_requests方法来替换parse方法

在该方法中自定义url与date参数

8.1 练习 爬取百度翻译

fanyipost.py文件

import scrapy
import json


class FanyipostSpider(scrapy.Spider):
    name = "fanyipost"
    allowed_domains = ["https://fanyi.baidu.com/sug"]
    # post请求 如果没参数 那么这个请求将没有任何意义
    # 所以这个start_urls 也就没有用
    # parse 方法也就没有用
    # start_urls = ["https://fanyi.baidu.com/langdetect/"]

    # def parse(self, response):
    #     pass

    def start_requests(self):
        url = "https://fanyi.baidu.com/sug/"

        data = {
            'kw':'lover'
        }

        yield scrapy.FormRequest(url=url,formdata = data,callback=self.parse_second)

    def parse_second(self,response):
        content = response.text
        obj = json.loads(content)
        print(obj)

settings.py文件

取消robots协议

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
当然可以。下面是一个使用Selenium和Chrome浏览器实现的简便的百度网盘免Cookies的函数: ```python from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.chrome.options import Options import time def baidu_pan_login(username, password): # 创建Chrome浏览器 options = Options() options.add_argument('--headless') # 设置为无头模式,不显示浏览器界面 driver = webdriver.Chrome(options=options) # 访问百度网盘登录页面 login_url = 'https://pan.baidu.com/' driver.get(login_url) # 输入用户名和密码,点击登录按钮 username_input = driver.find_element_by_id('TANGRAM__PSP_4__userName') password_input = driver.find_element_by_id('TANGRAM__PSP_4__password') submit_button = driver.find_element_by_id('TANGRAM__PSP_4__submit') username_input.send_keys(username) password_input.send_keys(password) submit_button.click() # 等待登录成功并获取BDUSS time.sleep(5) # 等待5秒,等待登录成功 bduss = driver.execute_script('return document.cookie.match(/BDUSS=([^;]+)/) && RegExp.$1') # 获取BDUSS driver.quit() # 关闭浏览器 return bduss ``` 该函数的思路是使用Selenium和Chrome浏览器自动化地模拟用户在网页上的操作,包括输入用户名和密码,点击登录按钮等。然后等待登录成功,使用JavaScript代码获取BDUSS,并关闭浏览器。需要注意的是,该函数需要安装Selenium和Chrome浏览器,并设置Chrome浏览器的驱动程序。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值