爬取笔趣阁小说目录练习
- 安装 scrapy
scrapy startproject tutorial
- 创建 items 结构
class Biquge(scrapy.Item):
title = scrapy.Field()
href = scrapy.Field()
- 创建 spider
scrapy genspider biquge www.xbiquge.la/xiaoshuodaquan
- 编辑 spider
# -*- coding: utf-8 -*-
import scrapy
from tutorial.items import Biquge
class BiqugeSpider(scrapy.Spider):
name = 'biquge'
allowed_domains = ['www.xbiquge.la/xiaoshuodaquan']
start_urls = ['http://www.xbiquge.la/xiaoshuodaquan/']
def parse(self, response):
items = []
for sel in response.xpath('//div[@class="novellist"]/ul/li'):
item = Biquge()
item['title'] = sel.xpath('a/text()').extract_first().strip()
item['href'] = sel.xpath('a/@href').extract_first().strip()
items.append(item)
return items
- 爬取并写入 json 文件
scrapy crawl biquge -o biquge.json
- result
[
{"title": "牧神记", "href": "http://www.xbiquge.la/15/15409/"},
{"title": "终极斗罗", "href": "http://www.xbiquge.la/7/7931/"},
.
.
.
{"title": "废土巫师", "href": "http://www.xbiquge.la/0/874/"},
{"title": "我的玉雕不正常", "href": "http://www.xbiquge.la/25/25679/"}
]
下一步目标写入 mongodb