scrapy简明教程

最新推荐文章于 2019-07-22 16:14:40 发布

水墨小龙虾

最新推荐文章于 2019-07-22 16:14:40 发布

阅读量820

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/jianhong1990/article/details/48252389

版权

爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

scrapy 0.24 简明教程

新建工程

scrapy startproject <project-name>

目录结构如下：

│  scrapy.cfg
└─demo
    │  items.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    └─spiders
            __init__.py

添加item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.item import Item, Field 

class DemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass



class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

添加爬虫

from scrapy.spider import BaseSpider
from demo.items import DmozItem

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

水墨小龙虾

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy简明教程

scrapy 0.24 简明教程新建工程scrapy startproject <project-name>目录结构如下：│ scrapy.cfg└─demo │ items.py │ pipelines.py │ settings.py │ __init__.py │ └─spiders __init__.py添加
复制链接

扫一扫

专栏目录