scrapy常见问题
dont_filter:默认为False,会过滤请求的url地址,即请求过的url地址不会继续被请求
scrapy.Request(url[,callback,method="GET",headers,body,cookies,meta,dont_filter=False])
dont_filter:默认为False,会过滤请求的url地址,即请求过的url地址不会继续被请求,对需要重复请求的url地址可以把它设置为Ture,比如贴吧的翻页请求,页面的数据总是在变化;start_urls中的地址会被反复请求,否则程序不会启动
在遇到上述翻页情况(没有下一页的按钮)时,也可以使用for循环
next_page = response.xpath("//div[@class='meta']/div/a")
if next_page:
for i in next_page:
next_page_url = i.xpath("./@href").extract_first()
yield scrapy.Request(url=response.urljoin(next_page_url), callback=self.parse_a, meta={'name':name})
即dont_filter:默认为False,会过滤请求的url地址,即请求过的url地址不会继续被请求
scrapy如何使用多个items模型类
scrapy使用多个item以及指定item进行json输出
items.py
import scrapy
class ZhongGuoErTongWenXueWangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
title= scrapy.Field()
zuo_zhe= scrapy.Field()
text_count = scrapy.Field()
class ZhongGuoErTongWenXueWangItem_1(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
title= scrapy.Field()
pipelines.py
注意一定要在pipelines.py中导入items.py文件中的类
# 注意一定要在pipelines.py中导入items.py文件中的类
from zhong_guo_er_tong_wen_xue_wang.items import ZhongGuoErTongWenXueWangItem,ZhongGuoErTongWenXueWangItem_1
class ZhongGuoErTongWenXueWangPipeline(object):
def process_item(self, item, spider):
if isinstance(item,ZhongGuoErTongWenXueWangItem_1):
print(22222222222222222222222222222222222)
print(22222222222222222222222222222222222)
with open(item['name']+'.txt','a') as f:
f.write('题目:'+item['title']+'\n')
f.write('作者:'+item['zuo_zhe'] + '\n')
f.write('内容:' + item['text_count'] + '\n')
else:
pass
return item
>>>a = 2
>>> isinstance (a,int)
True
>>> isinstance (a,str)
False
>>> isinstance (a,(str,int,list)) # 是元组中的一个返回 True
True
class A:
pass
class B(A):
pass
isinstance(A(), A) # returns True
type(A()) == A # returns True
isinstance(B(), A) # returns True
type(B()) == A # returns False