一、crawlspider
引入:
回顾之前的代码中,我们有很大一部分时间在寻找下一页的url地址或者是详情页的url地址上面,这个过程能更简单一些么?
定义:
是scrpay另一种爬取数据的方式
学习目标:
了解crawlspider的使用
crawlspiser是继承与spider这个爬虫类
它的特点:
根据规则提取链接发送给引擎
如何创建crawlspider
scrapy genspider -t crawl xx xx.com
有些场景使用crawlspider还是比较方便 前提是什么 (url的规律是比较人容易用正则来实现的) []
正则表达式一定要写对
案例:
需求:1)进入首页 2)进入详情页获取诗歌名称
代码:(D:\python_spider\day22\ancient_poems\ancient_poems\spiders\poems.py)
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class PoemsSpider(CrawlSpider):
name = 'poems'
allowed_domains = ['gushiwen.cn','gushiwen.org']
start_urls = ['https://www.gushiwen.cn/default_1.aspx']
rules = (
Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_[1,2].aspx'), follow=True),
Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item', follow=True)
)
def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
gsw_divs = response.xpath('//div[@class="left"]/div[@class="sons"]')
for gsw_div in gsw_divs:
title = gsw_div.xpath('.//h1/text()').get()
print(title)
return item
scrapy–Rule()与LinkExtractor()函数理解
Python爬虫之crawlspider类的使用
LinkExtractor的基本使用方法
二、scrapy实现登录
1 携带cookie向目标url发起请求
2 携带data(账号和密码)向目标url发送post请求
3 通过selenium来模拟登录 (input标签 切换登录方式 找到用户名和密码的输入框 定位按钮)
2.1cookie方式
2.1.1在爬虫程序中
第一种方式
目标url
https://user.qzone.qq.com/你的QQ号
总结:
1.通过分析源码找到了一个方法叫做 start_requests()方法直接携带cookes 再向start_urls发请求,这个时候response就是已经携带好cookie的了
3. 在scrapy中 cookie需要 k-v格式
cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
代码:(D:\python_spider\day22\qq_zone\qq_zone\spiders\qq.py)
import scrapy
class QqSpider(scrapy.Spider):
name = 'qq'
allowed_domains = ['qq.com']
start_urls = ['https://user.qzone.qq.com/你的QQ号']
def start_requests(self):
cookies='pgv_pvi=350465024; RK=n8qgPcxyTa; ptcz=c219dcd40cf2d30521a04833cdc036c2162182b9168f3f0886df3339dc6df90a; eas_sid=R185f9z9k6F4F9k2h8g8k938y9; pgv_pvid=8981388016; o_cookie=2023203294; pac_uid=1_2023203294; iip=0; LW_sid=s1K6x1T8D863J7z7S5d4U239z5; LW_uid=u1t6D1X8e8R3N7f7S5a432D9W6; tvfe_boss_uuid=6c52faf80d8c4e13; qz_screen=1536x864; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; luin=o2023203294; lskey=0001000008d7bbb8a7b813d5133af48265d02b770a5f00e5b307d085e61c105c86e592d006ed430f90301efe; Loading=Yes; cpu_performance_v8=16; _qpsvr_localtk=0.7949891033531788; uin=o2023203294; skey=@IpSNpAXPA; p_uin=o2023203294; pt4_token=5YcSMdjZBJlAcnbjIsTSDRVIWqFdDYKPbMOgBRGXTpY_; p_skey=5pkKMqWMVgegrtwhibgQQme4rxPJVH6V4J2vXSRwq6Y_; 2023203294_todaycount=0; 2023203294_totalcount=5222; pgv_info=ssid=s7030297087'
cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
#{i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
print(cookies)
yield scrapy.Request(
url=self.start_urls[0],
callback=self.parse_1,
cookies=cookies
)
def parse_1(self, response):
with open('qzone.html', 'w', encoding='utf-8') as file_obj:
file_obj.write(response.text)
2.1.2在下载中间件中
(D:\python_spider\day22\qq_zone\qq_zone\middlewares.py)
Request()方法中所有的参数都可以作为request对象的属性
Request()方法中所有的参数:url callback meta headers cookies
其中 headers cookies meta为字典
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
#设置cookies
cookies='pgv_pvi=350465024; RK=n8qgPcxyTa; ptcz=c219dcd40cf2d30521a04833cdc036c2162182b9168f3f0886df3339dc6df90a; eas_sid=R185f9z9k6F4F9k2h8g8k938y9; pgv_pvid=8981388016; o_cookie=2023203294; pac_uid=1_2023203294; iip=0; LW_sid=s1K6x1T8D863J7z7S5d4U239z5; LW_uid=u1t6D1X8e8R3N7f7S5a432D9W6; tvfe_boss_uuid=6c52faf80d8c4e13; qz_screen=1536x864; QZ_FE_WEBP_SUPPORT=1; __Q_w_s__QZN_TodoMsgCnt=1; luin=o2023203294; lskey=0001000008d7bbb8a7b813d5133af48265d02b770a5f00e5b307d085e61c105c86e592d006ed430f90301efe; Loading=Yes; cpu_performance_v8=16; _qpsvr_localtk=0.7949891033531788; uin=o2023203294; skey=@IpSNpAXPA; p_uin=o2023203294; pt4_token=5YcSMdjZBJlAcnbjIsTSDRVIWqFdDYKPbMOgBRGXTpY_; p_skey=5pkKMqWMVgegrtwhibgQQme4rxPJVH6V4J2vXSRwq6Y_; 2023203294_todaycount=0; 2023203294_totalcount=5222; pgv_info=ssid=s7030297087'
cookies = {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
# {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
print(cookies)
request.cookies=cookies
return None
settings.py文件中需要编辑的:
#没打开时,用爬虫文件中的cookies
#打开后,False 用的是settings.py文件中DEFAULT_REQUEST_HEADERS的cookies
#打开后,True 中间件中的cookies
COOKIES_ENABLED = True
补充:设置代理IP:
在下载中间件中:(D:\python_spider\day22\qq_zone\qq_zone\middlewares.py)
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
#设置代理IP: 用meta 注意
#以下是伪代码
proxy=random.choice(A)
request.meta['proxy']=proxy
return None
2.2 post请求
普通方法:
关键在于:scrapy.FormRequest
import scrapy
class GithubSpider(scrapy.Spider):
name = 'github'
allowed_domains = ['github.com']
start_urls = ['https://github.com/login']
def parse(self, response):
# authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').get()
# print(authenticity_token)
commit = 'Sign in'
authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
login = 'LogicJerry'
password = '12122121zxl'
timestamp = response.xpath("//input[@name='timestamp']/@value").extract_first()
timestamp_secret = response.xpath("//input[@name='timestamp_secret']/@value").extract_first()
data = {
'commit': commit,
'authenticity_token': authenticity_token,
'login': login,
'password': password,
'webauthn-support': 'supported',
'webauthn-iuvpaa-support': 'unsupported',
'timestamp': timestamp,
'timestamp_secret': timestamp_secret,
}
# 携带数据发送post请求 FormRequest表post请求
yield scrapy.FormRequest(
# 目标Url
url='https://github.com/session',
# 提交的数据
formdata=data,
# 响应的方法
callback=self.after_login
)
def after_login(self,response):
# 保存文件
with open('github.html','w',encoding='utf-8') as file_obj:
file_obj.write(response.text)
简单方法:
关键在于:scrapy.FormRequest.from_response
import scrapy
class Github2Spider(scrapy.Spider):
name = 'github2'
allowed_domains = ['github.com']
start_urls = ['https://github.com/login']
def parse(self, response):
yield scrapy.FormRequest.from_response(
# 请求的响应结果
response=response,#注意
# 提交数据
formdata={'login':'LogicJerry','password':'12122121zxl'},
# 回调方法
callback=self.after_login
)
def after_login(self,response):
# 保存文件
with open('github2.html','w',encoding='utf-8') as file_obj:
file_obj.write(response.text)
总结:
关注方法名称