1、需求分析
我们要得到小说热销榜的数据,每部小说的提取内容为:小说名字、作者、类型、形式。然后将得到的数据存入CSV文件。
2、创建项目
创建一个scrapy爬虫项目方式如下:
(1)在D盘下面创建一个文件夹scrapyProject
(2)打开终端,输入cmd回车
(3)将当前文件夹改到上面创建好的文件下
d:
cd d:\scrapyProject
(4)创建一个名为qidian_hot的项目
scrapy startproject qidian_hot
(5)用pycharm打开qidian_hot项目
其他文件暂时不用管,在spiders文件夹下创建一个Python文件qidian_hot_spider.py.
在这个文件下编写代码。
3、编写代码
from scrapy import Request
from scrapy.spiders import Spider
class HotSalesSpider(Spider):
name='hot'%定义爬虫名称,之后运行时要用到
%起始的URL列表
start_urls=['https://www.qidian.com/all/']
%解析函数,使用xpath解析
def parse(self,response):
list_selector=response.xpath("//div[@class='book-mid-info']")%定位到对应的div
for one_selector in list_selector:
name=one_selector.xpath("h2/a/text()").extract()[0]
author=one_selector.xpath("p[1]/a[1]/text()").extract()[0]
type=one_selector.xpath("p[1]/a[2]/text()").extract()[0]
form=one_selector.xpath("p[1]/span/text()").extract()[0]
%每一部小说的信息保存到字典
hot_dict={"name":name,
"author":author,
"type":type,
"form":form
}
yield hot_dict%使用yield返回字典
4、终端运行爬虫程序
(1)将文件夹改到项目文件夹下
d:
cd d:\scrapyProject\qidian_hot
(2)输入爬虫执行命令,回车
hot是爬虫名称,- o hot.csv是将爬取到的数据保存到CSV文件
scrapy crawl hot -o hot.csv
以下就是爬取到的数据,已保存到CSV:
name,author,type,form
夜的命名术,会说话的肘子,都市,连载
灵境行者,卖报小郎君,科幻,连载
不科学御兽,轻泉流响,玄幻,连载
这游戏也太真实了,晨星LL,轻小说,连载
深空彼岸,辰东,都市,连载
择日飞升,宅猪,仙侠,连载
神秘复苏,佛前献花,仙侠,连载
我的属性修行人生,滚开,玄幻,连载
宇宙职业选手,我吃西红柿,科幻,连载
家父汉高祖,历史系之狼,历史,连载
明克街13号,纯洁滴小龙,都市,连载
大夏文圣,七月未时,仙侠,连载
道诡异仙,狐尾的笔,玄幻,连载
我已不做大佬好多年,萌俊,都市,连载
我在修仙界长生不死,木工米青,仙侠,连载
术师手册,听日,轻小说,连载
镇妖博物馆,阎ZK,悬疑,连载
星界使徒,齐佩甲,游戏,连载
诸界第一因,裴屠狗,玄幻,连载
修仙就是这样子的,凤嘲凰,仙侠,连载
5、改进
设置请求头、重写start_requests()方法
from scrapy import Request
from scrapy.spiders import Spider
class HotSalesSpider(Spider):
name='hot1'
# 设置用户代理,伪装成浏览器
qidian_headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0;"
"Win64;x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
'''
start_urls=['https://www.qidian.com/rank/yuepiao/year2022-month06-page1/','https://www.qidian.com/rank/yuepiao/year2022-month06-page2/',
'https://www.qidian.com/rank/yuepiao/year2022-month06-page3/','https://www.qidian.com/rank/yuepiao/year2022-month06-page4/',
'https://www.qidian.com/rank/yuepiao/year2022-month06-page5/']
'''
#获取初始的request,重写start_requests()方法
def start_requests(self):
url="https://www.qidian.com/rank/yuepiao/year2022-month06-page1/"
#生成请求对象,设置URL,headers,callback(回调函数,把下载好的页面即Request对象发送给解析函数,执行数据解析功能)
yield Request(url,headers=self.qidian_headers,callback=self.qidian_parse)
def qidian_parse(self,response):
list_selector=response.xpath("//div[@class='book-mid-info']")
for one_selector in list_selector:
name=one_selector.xpath("h2/a/text()").extract()[0]
author=one_selector.xpath("p[1]/a[1]/text()").extract()[0]
type=one_selector.xpath("p[1]/a[2]/text()").extract()[0]
form=one_selector.xpath("p[1]/span/text()").extract()[0]
hot_dict={"name":name,
"author":author,
"type":type,
"form":form
}
yield hot_dict
还可以把请求头放到setting.py文件下:这样就不用每次都引入了
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'qidian_hot (+http://www.yourdomain.com)'
USER_AGENT="Mozilla/5.0 (Windows NT 10.0;Win64;x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
6、获取多页
关键是找到页码改变时,URL中参数的改变规律
from scrapy import Request
from scrapy.spiders import Spider
class HotSalesSpider(Spider):
name='hot'
current_page=1#记录当前的页码,初始值为1
# 设置用户代理,伪装成浏览器
'''
qidian_headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0;"
"Win64;x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
'''
'''
start_urls=['https://www.qidian.com/rank/yuepiao/year2022-month06-page1/','https://www.qidian.com/rank/yuepiao/year2022-month06-page2/',
'https://www.qidian.com/rank/yuepiao/year2022-month06-page3/','https://www.qidian.com/rank/yuepiao/year2022-month06-page4/',
'https://www.qidian.com/rank/yuepiao/year2022-month06-page5/']
'''
#获取初始的request,重写start_requests()方法
def start_requests(self):
url="https://www.qidian.com/rank/yuepiao/year2022-month06-page1/"
#生成请求对象,设置URL,headers,callback(回调函数,把下载好的页面即Request对象发送给解析函数,执行数据解析功能)
yield Request(url,callback=self.qidian_parse)#已在setting.py中设置了请求头
def qidian_parse(self,response):
list_selector=response.xpath("//div[@class='book-mid-info']")
for one_selector in list_selector:
name=one_selector.xpath("h2/a/text()").extract()[0]
author=one_selector.xpath("p[1]/a[1]/text()").extract()[0]
type=one_selector.xpath("p[1]/a[2]/text()").extract()[0]
form=one_selector.xpath("p[1]/span/text()").extract()[0]
hot_dict={"name":name,
"author":author,
"type":type,
"form":form
}
yield hot_dict
#获取下一页的URL,并生成Request请求,提交给引擎
self.current_page+=1
if self.current_page<=3:
next_url='https://www.qidian.com/rank/yuepiao/year2022-month06-page%d/'%(self.current_page)
yield Request(next_url,callback=self.qidian_parse)