scrapy基本指令
创建项目指令
scrapy startproject name
创建爬虫指令
scrapy genspider pcname XX.com
启动爬虫指令
scrapy crawl pcname
调试爬虫指令
scrapy shell url
or
scrapy shell
加url response会自动下载下来url内容,打开shell。
不加不下载,可以在其中用 fetch(url) 下载
Response类
属性相关
- body 响应的字节数据
- text 响应之后的文本数据
- headers 响应头信息
- encoding 响应数据的编码字符集
- status 响应的状态码
- url 请求的url
- request 请求的对象
- meta 元数据,用于request与callback之间传值
解析相关
- selector()
- css() 样式选择器,返回Selector选择器的可迭代对象(列表)
- scrapy.selector.SelectorList
- x()/xpath()
- scrapy.selector.Selector
- css选择器提取文本
` ::text 提取文本
:: attr(“属性名”)提取属性
- scrapy.selector.SelectorList
- xpath() xpath路径
- 选择器常用的方法
- css()/xpath()
- extract() 提取选择中的所有内容,返回是list
- extract_first()/get() 提取每个选择器的内容,返回是文本
实战1:爬取起点中文网小说信息
代码
import scrapy
from scrapy.http import Request,Response
class WanbenSpider(scrapy.Spider):
name = 'wanben'
allowed_domains = ['qidian.com','book.qidian.com']
start_urls = ['https://www.qidian.com/finish']
def parse(self, response):
if response.status==200:
lis=response.css('.all-img-list li')#selectorlis
for i in lis :
items = {}
# i对象类型是selector,注意selector没有x()函数
a=i.xpath('./div[1]/a')
items['book_url']=a.xpath('./@href').get()
items['book_cover']=a.xpath('./img/@src').get()
items['book_name']=i.xpath('div[2]/h4//text()').get()
items['author'],*items['tags']=i.css('.author a::text').extract()
items['summary']=i.css('.intro::text').get()
yield items
# yield Request('https:'+items['book_url'],callback=self.parse_info,priority=1)
next_url='https:'+response.css('.lbf-pagination-item-list').xpath('./li[last()]/a/@href').get()
if next_url.find('page')!=-1:
yield Request(next_url,priority=100)#优先级越改,会优先下载
def parse_info(self,response:Response):
# print("-----------------------解释小说详情界面")
pass
实战2:爬取绝对领域cosplay一整页套图
代码
import scrapy
import requests
import os
from scrapy.http import Request,Response
class JdlySpider(scrapy.Spider):
name = 'jdly'
allowed_domains = ['jder.net']
start_urls = ['https://www.jder.net/cosplay']
def parse(self, response):
if response.status==200:
lis=response.css(".post-module-thumb a")
c=lis.xpath('./@href').extract()
for i in c:
yield Request(i,callback=self.get_img)
# for j in range(2,10):
# next_url='https://www.jder.net/cosplay/page/'+str(j)
# yield Request(next_url) # 优先级越改,会优先下载
def get_img(self,response:Response):
if response.status == 200:
list = response.css(".entry-content p")
head=response.css(".entry-header h1::text").get()
imgs=list.xpath("./img/@src").extract()
# for i in imgs:
# print(i)
x=1
path=r'F:\360MoveData\Users\ZMZ\Desktop\pachong\\'
os.mkdir(path+head)
path2=r'F:\360MoveData\Users\ZMZ\Desktop\pachong\\'+head+'\\'
for url in imgs:
file_name="图片"+str(x)+".jpeg"
q=requests.get(url)
with open(path2+file_name,'wb') as f:
f.write(q.content)
print("成功爬取%d" % x)
x+=1
print("爬取结束")
效果
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BwQHpFCS-1599897977481)(https://i.loli.net/2020/08/31/rm7ViMSesHzclk4.png)]