Scrapy使用

最新推荐文章于 2024-07-12 16:16:27 发布

cheyun8561

最新推荐文章于 2024-07-12 16:16:27 发布

阅读量120

点赞数

文章标签： python 爬虫 json

原文链接：https://my.oschina.net/odetteisgorgeous/blog/3096742

版权

环境安装

python3.7
scrapy

爬虫开发

注意: scrapy开发使用普通windows的cmd进入或者git bash都可以，编辑完爬虫逻辑后使用scrapy命令来运行

添加项目文件

scrapy startproject tutorial

项目结构

在spiders文件下，增加quotes.py

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

运行爬虫

scrapy crawl quotes

quotes是上面类中的name，表示爬虫的名字，spiders文件下，如果有多个爬虫文件，name不能重复

爬虫选择器在线调试

scrapy shell 'http://quotes.toscrape.com/page/1/'

response.css('title::text').getall()

response.css('title::text').re(r'Quotes.*')

response.xpath('//title/text()').get()

抽取数据并输出文件

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

执行

scrapy crawl quotes -o quotes-humor.json

命令行传递参数

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

执行

scrapy crawl quotes -o quotes-humor.json -a tag=humor

参考资料

Scrapy教程

转载于:https://my.oschina.net/odetteisgorgeous/blog/3096742

cheyun8561

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scrapy使用

环境安装python3.7scrapy爬虫开发注意: scrapy开发使用普通windows的cmd进入或者git bash都可以，编辑完爬虫逻辑后使用scrapy命令来运行添加项目文件scrapy startproject tutorial在spider...
复制链接

扫一扫