scrapy基础入门(一)

最新推荐文章于 2022-03-03 21:53:35 发布

thginWalker

最新推荐文章于 2022-03-03 21:53:35 发布

阅读量450

点赞数

分类专栏：网络爬虫 # Scrapy 文章标签： scrapy

本文链接：https://blog.csdn.net/XZ2585458279/article/details/79338014

版权

网络爬虫同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

Scrapy

5 篇文章 0 订阅

订阅专栏

前言

闲来无事捣腾了python爬虫，发现scrapy入门不错，虽然暂时对xpath和css的操作不是太了解。但是简单的爬虫还是挺好写的。

扩展

scrapyd可以部署多个scrapy爬虫，能够在网页端查看正在执行的任务，也能新建爬虫任务，和终止爬虫任务，功能比较强大。

安装

pip install Scrapy

常规基础

shell调试

scrapy shell "网址"

注：方便交互式编程和练习。

新建项目

scrapy startproject 项目名

cd 项目名

爬虫模板

scrapy genspider 爬虫名 允许域名

更改爬虫
根据需求更改爬虫

在items.py编写返回类目,根据需要进行编写
编写爬虫体

运行爬虫

scrapy crawl 爬虫名 -o 产生文件

注：可以产生文件类型JSON、CSV、XML

源码实例

一窥示例spider
以下的代码将跟进StackOverflow上具有投票数最多的链接，并且爬取其中的一些数据:
stackoverflow_spider.py

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com/questions?sort=votes']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract()[0],
            'votes': response.css('.question .vote-count-post::text').extract()[0],
            'body': response.css('.question .post-text').extract()[0],
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

运行:

scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json

追踪链接

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            url = response.urljoin(response.url, href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

技巧

一种常见的方法是,回调函数负责提取一些item,查找能跟进的页面的链接, 并且使用相同的回调函数yield一个 Request:

def parse_articles_follow_next_page(self, response):
    for article in response.xpath("//article"):
        item = ArticleItem()

        ... extract article data here

        yield item

    next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
    if next_page:
        url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(url, self.parse_articles_follow_next_page)

上述代码将创建一个循环,跟进所有下一页的链接,直到找不到为止 – 对于爬取博客、论坛以及其他做了分页的网站十分有效。