网络爬虫三

最新推荐文章于 2020-07-15 14:54:50 发布

Calm微笑

最新推荐文章于 2020-07-15 14:54:50 发布

阅读量157

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/yao1373446012/article/details/85856444

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

scrapy 是一个流行的网络爬虫框架

一，创建项目在终端输入

scrapy startproject example (example为项目名)

cd example

此时会生成几个文件

二，定义模型

example/items.py文件包含如下代码

# -*- coding: utf-8 -*-

import scrapy

#存储想要抓取的信息
class ExampleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    population = scrapy.Field()

三，创建爬虫

$ scrapy genspider country example webscraping.com --template=crawl

传入爬虫名，域名，以及可选的模板参数，可自动生成country.py文件，然后再添加响应的爬虫代码

country.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from example.items import ExampleItem
class CountrySpider(CrawlSpider):
    name = 'country' #定义爬虫名称的字符串
    allowed_domains = ['example.webscraping.com'] #可爬取的域名列表
    start_urls = ['http://example.webscraping.com/'] #爬虫起始URL列表

    rules = (
        Rule(LinkExtractor(allow='/index/', deny='/user/'), follow=True),
        Rule(LinkExtractor(allow='/view/', deny='/user/'),callback='parse_item')
    )#正则表达式集合，告知爬虫要跟踪哪些链接

#从响应中获取数据 ，这里采用css解析
    def parse_item(self, response):
		item = ExampleItem()
		name_css = 'tr#places_country_row td.w2p_fw::text'
		item['name'] = response.css(name_css).extract()
		pop_css = 'tr#places_population_row td.w2p_fw::text'
		item['population'] = response.css(pop_css).extract()
		return item

优化设置：

避免爬虫被禁，需要更新scrapy的设置，避免爬虫被封禁，默认情况下，scrapy 对同一域名允许最多8个并发下载，并且两次之间没有延迟

所以需要在example/settings.py中加入以下代码，使爬虫同时只能对每个域名发起一个请求，并且没两次请求之间存在延时

CONCURRENT_REQUESTS_PER_DOMAIN =1

DOWNLOAD_DELAY = 5

运行 $scrapy crawl country --output=countries.csv -s LOG_LEVEL=INFO

此时数据会自动保存在csv文件。

Calm微笑

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
网络爬虫三

scrapy 是一个流行的网络爬虫框架一，创建项目在终端输入scrapy startproject example (example为项目名)cd example此时会生成几个文件二，定义模型example/items.py文件包含如下代码# -*- coding: utf-8 -*-import scrapy#存储想要抓取的信息class Exam...
复制链接

扫一扫