2021/6/13爬虫第七次周复盘

最新推荐文章于 2021-07-20 21:39:21 发布

笔记本IT

最新推荐文章于 2021-07-20 21:39:21 发布

阅读量406

点赞数 1

分类专栏： Scrapy 数据库爬虫文章标签： scrapy redis

本文链接：https://blog.csdn.net/httpsssss/article/details/117709338

版权

爬虫同时被 3 个专栏收录

15 篇文章 3 订阅

订阅专栏

数据库

5 篇文章 0 订阅

订阅专栏

Scrapy

3 篇文章 0 订阅

订阅专栏

文章目录

一、码代码总结
二、scrapy中的反反爬
三、在下载中间件中设置UA
四、Scrapy下载图片（两种方式）
五、hashlib用法，加密方式
六、crawlspider
七、scrapy实现登录（关注方法名称）
- 7.1携带cookies
- 7.2携带data进行post请求
八、redis（键值对存储数据的nosql数据库）
九、补充

一、码代码总结

（码前）
页面分析
真正理解代码
（run前）
注意缩进
看是否有明显错误
（出现BUG）
先看URL yield
xpath
settings
细节：
使用Images Pipeline下载图片时，images文件夹要自己建
解析数据

xpath 先在草稿纸上写下来
json.loads(response.text) #把字符串变为字典

二、scrapy中的反反爬

ua（随机/pycharm自带的随机UA）在headers（字典）中
cookies [爬虫文件、中间件中的cookies、settings.py] (字典)
settings中要注意：

#没打开时，用爬虫文件中的cookies
#打开后，False 用的是settings.py文件中DEFAULT_REQUEST_HEADERS的cookies
#打开后，True  中间件中的cookies
COOKIES_ENABLED = True

代理IP[下载中间件中修改meta] （字典）
verify? session?
注意：
cookies、代理IP都是字典

与requests()方法的对比：
参数：headers[包括UA、cookies]，proxies (都是字典) verify session

三、在下载中间件中设置UA

class UserAgentDownloaderMiddleware:
    USER_AGENTS = [
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"]
    # 第一种方式
    def process_request(self, request, spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = user_agent
        
    # 第二种方式
    # def process_request(self, request, spider):
    #     ua = UserAgent()
    #     user_agent = ua.random
    #     # print(user_agent)
    #     request.headers['User-Agent'] = user_agent

查看UA：‘http://httpbin.org/user-agent’

四、Scrapy下载图片（两种方式）

见第二十一讲

五、hashlib用法，加密方式

import hashlib
h = hashlib.sha1()
print(h)   #<sha1 HASH object @ 0x0000014F2AB744E0>
h.update('hello'.encode('utf-8'))
print(h.hexdigest()) # hexdigest() 返回的是十六进制的字符串
#aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d

六、crawlspider

另一种翻页方式（一级页与二级页）
如何创建crawlspider
scrapy genspider -t crawl xx xx.com

案例：
需求：1）进入首页 2）进入详情页获取诗歌名称
代码：（D:\python_spider\day22\ancient_poems\ancient_poems\spiders\poems.py）

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class PoemsSpider(CrawlSpider):
    name = 'poems'
    allowed_domains = ['gushiwen.cn','gushiwen.org']
    start_urls = ['https://www.gushiwen.cn/default_1.aspx']

    rules = (
        Rule(LinkExtractor(allow=r'https://www.gushiwen.cn/default_[1,2].aspx'), follow=True),
        Rule(LinkExtractor(allow=r'https://so.gushiwen.cn/shiwenv_\w+.aspx'), callback='parse_item', follow=True)
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        gsw_divs = response.xpath('//div[@class="left"]/div[@class="sons"]')

        for gsw_div in gsw_divs:
            title = gsw_div.xpath('.//h1/text()').get()
            print(title)


        return item

七、scrapy实现登录（关注方法名称）

7.1携带cookies

爬虫文件中：
start_requests()方法中：

cookies='aaaaaaa'
        cookies =  {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')}
yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_1,
            cookies=cookies
        )

在下载中间件中
process_request(self, request, spider)中

cookies='aaaaa'
        cookies =  {i.split('=')[0]:i.split('=')[1] for i in cookies.split('; ')
        request.cookies=cookies

设置代理IP：
process_request(self, request, spider)中

proxy=random.choice(A)
request.meta['proxy']=proxy

7.2携带data进行post请求

parse(self, response)中：
方式1：（scrapy.FormRequest）

data = {
            'commit': commit,
            'authenticity_token': authenticity_token,
            'login': login,
            'password': password,
            'webauthn-support': 'supported',
            'webauthn-iuvpaa-support': 'unsupported',
            'timestamp': timestamp,
            'timestamp_secret': timestamp_secret,
        }

        # 携带数据发送post请求  FormRequest表post请求
        yield scrapy.FormRequest(
            # 目标Url
            url='https://github.com/session',
            # 提交的数据
            formdata=data,
            # 响应的方法
            callback=self.after_login)

方式2：（scrapy.FormRequest.from_response）

import scrapy


class Github2Spider(scrapy.Spider):
    name = 'github2'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            # 请求的响应结果
            response=response,#注意
            # 提交数据
            formdata={'login':'LogicJerry','password':'12122121zxl'},
            # 回调方法
            callback=self.after_login
        )

    def after_login(self,response):
        # 保存文件
        with open('github2.html','w',encoding='utf-8') as file_obj:
            file_obj.write(response.text)

八、redis（键值对存储数据的nosql数据库）

redis的五大命令（记一些常用的），见二十三讲

九、补充

1.pycharm实现自动转换为字典
ctrl+r
(.): (.)$
“$1”: “$2”
replace all
2.scrapy settings设置详解
#DOWNLOAD_DELAY = 3 下载器在下载同一个网站下一个页面前需要等待的时间,该选项可以用来限制爬取速度,减轻服务器压力。同时也支持小数:0.25 以秒为单位
3.什么是Python轻应用
4.Cookie在Get请求和Post请求中的区别
 Cookie和Post模拟登陆
5.rule

有时间要去学cmd命令知识

笔记本IT

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
3
评论
2021/6/13爬虫第七次周复盘

（码前）页面分析真正理解代码（run前）注意缩进看是否有明显错误（出现BUG）先看URL yieldxpathsettings细节：使用Images Pipeline下载图片时，images文件夹要自己建解析数据xpath 先在草稿纸上写下来json.loads(response.text) 即通过字典...
复制链接

扫一扫