明确目标
- 目标url:https://www.ryjiaoyu.com/tag/details/7。
- 静态网页。
- 需要进入详情页爬取信息。
- scrapy框架
建项目
- 新建scrapy项目:
1.创建爬虫项目,命令:scrapy startproject 项目名称。
例如scrapy startproject RenYou
2.创建爬虫文件,命令:scrapy genspider 文件名称 域名
例如scrapy genspider ry https://www.ryjiaoyu.com/tag/details/7 - 具体请参考https://www.jianshu.com/p/8e78dfa7c368
- 建好后(没有pictyres这个文件):
代码实现
import scrapy
from ..items import RenyouItem
class RySpider(scrapy.Spider):
name = 'ry'
allowed_domains = ['ryjiaoyu.com']
start_urls = ["http://www.ryjiaoyu.com/tag/books/7?page=0"]
def parse(self,response):
for i in range(0,32):
url = "http://www.ryjiaoyu.com/tag/books/7?page=0" + str(i)
yield scrapy.Request(url=url,callback=self.get_title)
def get_title(self, response):
lis = response.xpath("//li[@class='block-item']")
for i in lis:
name = i.xpath("div[@class='book-info']/h4[@class='name']/a/text()").get()
ZuoZhe1 = i.xpath("string(div[@class='book-info']/div[@class='author']/span)").get()
ZuoZhe = str(ZuoZhe1).strip()
JiaGe = i.xpath("div[@class='book-info']/span[@class='paperback']/span/text()").get()
if JiaGe is not None:
JiaGe = JiaGe.replace("¥", "")
else:
JiaGe = "暂无价格"
url = i.xpath("div[@class='book-info']/h4[@class='name']/a/@href").get()
url1 = response.urljoin(url)
item = RenyouItem()
item['书名'] = name
item['作者'] = ZuoZhe
item['价格'] = JiaGe
item['链接'] = url1
yield scrapy.Request(url=url1, meta={'item': item}, callback=self.parseDetail)
def parseDetail(self, response):
item = response.meta.get('item')
item['内容介绍'] = response.xpath('//div[@class="intro"]/text()').get()
if item['内容介绍'] is not None:
item['内容介绍'] = str(item['内容介绍']).strip()
item['内容介绍'] = item['内容介绍'].replace("\r", "")
item['内容介绍'] = item['内容介绍'].replace("'", "")
item['内容介绍'] = item['内容介绍'].replace("\n", "")
else:
item['内容介绍'] = "没得内容介绍"
item['书号'] = response.xpath('//ul[@class="publish-info"]/li[4]').get()
if "书 号:" in item['书号']:
item['书号'] = response.xpath('//ul[@class="publish-info"]/li[4]/text()').get()
else:
item['书号'] = response.xpath('//ul[@class="publish-info"]/li[3]/text()').get()
item['出版日期'] = response.xpath('//ul[@class="publish-info"]/li[3]').get()
if "出版日期:" in item['出版日期']:
item['出版日期'] = response.xpath('//ul[@class="publish-info"]/li[3]/text()').get()
else:
item['出版日期'] = response.xpath('//ul[@class="publish-info"]/li[4]/text()').get()
if item['出版日期'] is not None:
item['出版日期'] = item['出版日期'].replace("\r\n ", "")
else:
item['出版日期'] = "没得出版日期"
item['图片链接'] = response.css('ul.block-items img::attr(src)').getall()
print(item)
yield item
- 执行命令:在Terminal输入scrapy crawl ry -o renyou.json;运行爬虫框架并保存为json文件。
- 这里推荐用在线json视图查看json文件:bejson.com
结果截图