1、新建项目
scrapy startproject tutorial
整体结构如下
2、修改items
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class SinaminiItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
content = scrapy.Field()
3、spider
最后就是spider了,我也懒得写pipe了,结果截图展示
spider也是继承了最简单的spider,起名为myspider.py ,如下:
代码如下:
import scrapy
from sinamini.items import SinaminiItem
class DmozSpider(scrapy.Spider):
name = "mysina"
allowed_domains = ["sports.sina.com.cn"]
start_urls = [
"http://sports.sina.com.cn/g/pl/2017-05-23/doc-ifyfkqks4451477.shtml"
]
def parse(self, response):
try:
for sel in response.xpath("//article[@class='article-a']"):
item = SinaminiItem()
item['name'] = sel.xpath('h1/text()').extract()
item['content'] = sel.xpath("div[@class='article-a__content']/p/text()").extract()
yield item
except:
print('error')
for url in response.selector.xpath("//a/@href").re(r'^http://sports.sina.*'):
yield scrapy.Request(url,callback = self.parse)
结果展示
scrapy crawl mysina ###执行spider
结果如下: