1、scrapy 多页爬取
# spider编码在原基础之上, 构建其他页面的url地址, 并利用scrapy.Request发起新的请求, 请求的回调函数依然是parse:
page = 1
base_url = 'http://www.xiaohuar.com/list-1-%s.html'
if self.page < 4:
page_url = base_url%self.page
self.page += 1
yield scrapy.Request(url=page_url, callback=self.parse)
# (其他文件不用改动)
2、scrapy爬取详情页
需求: 爬取笑话的标题与详情页连接, 通过详情页链接, 爬取详情页笑话内容
# item编码: 定义数据持久化的字段信息
import scrapy
class JokeItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
content = scrapy.Field()
# spider的编码:
# -*- coding: utf-8 -*-
import scrapy
from ..items import JokeItem
class XhSpider(scrapy.Spider):
name = 'xh'
# allowed_domains = ['www.baidu.com']
start_urls = ['http://www.jokeji.cn/list.htm']
def parse(self, response):
li_list = response.xpath('//div[@class="list_title"]/ul/li')
for li in li_list:
title = li.xpath('./b/a/text()')