Scrapy框架初探

最新推荐文章于 2023-06-12 19:53:42 发布

Flora_xuan1993

最新推荐文章于 2023-06-12 19:53:42 发布

阅读量352

点赞数

分类专栏： python-scrapy 文章标签： python scrapy

本文链接：https://blog.csdn.net/Flora_xuan1993/article/details/78729057

版权

python-scrapy 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

BZ记性不好，写过的scrapy都忘了咋写，
于是，在经历第二次从头开始后，决定写下本文作为记录。
适合白到不能再白的小白

首先，本文跳过安装python、scrapy，直接创建新项目

scrapy startproject SCF       # SCF为新创建的项目名称

除了自己定义的爬虫文件，下面这些，scrapy都会为你生成好。

# scrapy框架基本结构
－－ SCF
　　　－－ spiders
　　　　　　－－ \__init__.py　　　　　　
　　　　　　－－ func_spider.py      # 编写自己的爬虫，名称自定义
　　　－－ \__init__.py
　　　－－ items.py                 # 定义对象
　　　－－ middlewares.py
　　　－－ pipelines.py             # 处理爬取到的item的信息
　　　－－ settings.py              # 配置文件

item.py

python是面向对象的语言，很简单，你要爬取的对象，就是item，item.py的用途，就是定义你要爬取的对象的字段。一个class就是一种类型的item。

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ScfItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    function = scrapy.Field()
    includes = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class ScfPipeline(object):
    def __init__(self):
        self.file = open('StandardC.txt', 'wb')

	# process_item()是实际上处理item的部分，此处是以json格式写入文件
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

	# close_spider是在关闭爬虫时做的工作，通常为释放资源
    def close_spider(self, spider):
        self.file.close()

pipeline需要在配置文件中进行设置，数字代表优先级，数字小的pipeline优先执行，也即可以对item进行多层pipeline的处理。

# settings.py

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {
   'SCF.pipelines.ScfPipeline': 1,  # 数字小的先执行
}

第一个爬虫：func_spider.py

很无趣的就是想看看，标准C库函数对应的函数头文件都是啥。。。
于是就想从man page上扒下来。。。
开始

import scrapy, re
from SCF.items import ScfItem        # 导入定义好的item(model)

class FuncSpider(scrapy.Spider):
    name = 'function'                # 爬虫的名称，在执行爬虫的时候用到
    allowed_domains = ['man7.org']   # 爬取的网页的url范围
    start_urls = ['http://man7.org/linux/man-pages/dir_section_3.html']           # 定义爬虫起点网址


	# 每个网页的爬虫，都从parse函数开始
    def parse(self, response):
	    # 可以通过selector、xpath等工具定位网页
        funcNames = response.xpath("//table[1]//td[@valign='top']/a/@href").extract()
        for i in funcNames:
	        # 相对地址补全为绝对地址
            url = 'http://man7.org/linux/man-pages' + i[1:]
            # 进入二级页面的爬取，callback定义调用的函数
            yield scrapy.Request(url=url, callback=self.parse_url)

	# 二级页面的爬虫函数
    def parse_url(self, response):
        item = ScfItem()            # 创建对象
        lines = response.xpath("//pre//text()").extract()
        res = []
        for i in lines:
            try:
                re_func = re.compile("#include\s+<(\w+[\/\w+]*\.h)>")
                temp_str = re_func.findall(i)
                if temp_str:
                    res.extend(temp_str)
            except:
                continue
        res = list(set(res))
        item['function'] = response.url.split("/")[-1].split(".")[0]
        item['includes'] = " ".join(res)

        yield item                  # 返回item，进入pipelines

本文重点不在selector和xpath的定位语法上，在这个地方就不赘述了。

运行爬虫

scrapy通过命令行运行爬虫

scrapy crawl function（爬虫名称）

那么，有时候使用ide调试怎么执行爬虫呢？
scrapy提供了对应的函数，帮助我们。

新建一个debug.py脚本

# !/usr/bin/python

from scrapy.cmdline import execute

execute()            # cmdline中的execute()函数作用就是执行爬虫

配置debug，将crawl function作为参数输入，就可以使用ide的调试功能了。
这里写图片描述

Flora_xuan1993

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy框架初探

本文跳过安装python、scrapy，直接创建新项目scrapy startproject SCF # SCF为新创建的项目名称除了自己定义的爬虫文件，下面这些，scrapy都会为你生成好。
复制链接

扫一扫

专栏目录