scrapy 学习笔记-持续更新

最新推荐文章于 2021-09-06 17:41:15 发布

冬兰

最新推荐文章于 2021-09-06 17:41:15 发布

阅读量185

点赞数

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/quanshui_dd/article/details/104131650

版权

python 同时被 2 个专栏收录

15 篇文章 1 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

一、Scrapy Shell

基本用法

1、终端输入：scrapy shell指令
2、 fetch（url）

定位元素

1、输入response，查看响应（有响应，才能做后续定位）
2、根据response.css，查看是否能定位到相应元素
获取属性”a h3::attr(href)”.get()/getall()
获取文本，::text
3、具体的用法，查看官方文档https://docs.scrapy.org/en/latest/topics/shell.html

查看追踪到的url

方式一：
scrapy shell
fetch(你的url)

方式二：
scrapy shell 你的url
from scrapy.linkextractors import LinkExtractor
myurl = LinkExtractor(allow=r"com/catalogue/.*?/index.html此处是正则表达式的过滤条件，过滤url")
len(myurl)
myurl[0]
myurl.extract_links(response)

二、Scrapy - CrawlSpider

1、创建爬虫项目

终端输入：
scrapy startproject xxx项目名
cd xxx文件名，即上一行的项目名
普通爬虫：
scrapy genspider xxxx爬虫名 xxx爬取的域名

crawl爬虫
scrapy genspider -t crawl xxx爬虫名 xxx爬取的域名

2、pycharm 完善爬虫程序

爬虫项目已创建，接下来需要在pycharm中完善代码：

crawl爬虫：

1）设定items，确认爬取的目标

name = scrapy.Field()

2）设定pipeline，确认爬取后的数据，需要做哪些处理，如去重，去漏，保存等
注意：建议不要以item[‘xxx’]的方式获取item数据（如果xxx不存在，会报错）
建议以item.get(‘xxx’)的方式获取

class PriceToRMBPipeline(object):
    def process_item(self,item,spider):
        if isinstance(item,My02BooksItem):
            bookprice = item.get("bookprice")
            if bookprice:
                temp = float(bookprice) * float(exchange_rate)
                item["bookprice"] = '%.4f' % temp
                return item
            else:
                raise DropItem
        else:
            return item

3）设定spider中的爬虫程序，
（1）设定rule

rules = (
        Rule(LinkExtractor(allow=r'com/catalogue/page-\d+.html'), callback='parse_item', follow=True),
        Rule(LinkExtractor(allow=r'com/catalogue/.*?/index.html'),callback="detail_item",follow=False)
    )

（2）完善parse_item函数，以及其他需要回调的函数
包括解析response，定位元素等
封装item（需要将items中的item类导入）
最后yield item
4）启用settings中的pipeline的顺序，

ITEM_PIPELINES = {
   'my_02_books.pipelines.PriceToRMBPipeline': 300,
   'my_02_books.pipelines.SaveToFilePipeline': 400,
}

5)在pycharm的terminal中启用爬虫程序

scrapy crawl 爬虫程序名

等待程序结束，即可

三、实用技巧

1、 ‘%.4f’ % temp指定小数点位数

item["bookprice"] = '%.4f' % temp

2、isinstance(item,xxxItem)判断item值由哪个类传回

        if isinstance(item,My02BooksItem):
            bookprice = item.get("bookprice")
            if bookprice:
                temp = float(bookprice) * float(exchange_rate)
                item["bookprice"] = '%.4f' % temp
                return item
            else:
                raise DropItem
        else:
            return item

3、抛出DropItem异常

1）先导入该模块

from scrapy.exceptions import DropItem

2)抛出异常

raise DropItem

四、使用注意事项

1、页面内容和爬取到的内容顺序不一定相同

爬取和解析快慢不一，先获取到哪个数据，就先处理哪个数据，所以顺序方面无法保证。
如果一定要按照顺序，可以将settings中的CONCURRENT_REQUESTS = 32，改为并发数为1,但这样就失去了scrapy框架的优势。

2、爬虫适用于IO密集型任务，不适用于CPU密集型任务

五、pandas

import pandas as pd

res = requests.get(url,headers=headers)
bs = BeautifulSoup(res.content,"lxml")
table = bs.find("table").prettify()

#读取表格，写入DataFrame df
df = pd.read_html(table,header=0)
#
df[0].to_csv("my_result.csv",header=None,encoding="gbk",mode="a")

#可以通过DataFrame名["city"]的方式，增加列，具体可以查看pandas的用法