一、解析操作
1. 使用xpath解析数据
-
response.xpath()
: xpath返回的是一个列表, 但是列表元素一定是Selector类型的对象 -
extract()
: 可以将Selector对象中data参数存储的字符串提取出来, 列表嗲用了extract之后, 则表示将列表中每一个Selector对象中data对应的字符串提取了出来
二、持久化存储
1.基于终端指令
只可以将parse方法的返回值存储到本地的文本文件中
- 使用命令 :
scrapy crawl 项目名称 -o ./文件名.csv
将文件存储到本地
import scrapy
class QiubaiproSpider(scrapy.Spider):
name = 'qiubaiPro'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
joke_list = response.xpath("//div[contains(@class, 'article block')]")
all_dic = []
for joke in joke_list:
author = joke.xpath("./div/a[2]/h2/text()")[0].extract()
print(author)
content = joke.xpath(".//div[@class='content']/span/text()").extract()
content = ''.join(content)
dic = {
'author': author,
'content': content
}
all_dic.append(dic)
return all_dic
2. 基于管道
- 编码流程 :
- 数据解析
- 在item类中定义相关的属性
import scrapy
class ScrapyworkItem(scrapy.Item):
# define the fields for your item here like:
author = scrapy.Field()
content = scrapy.Field()
- 将解析的数据封装存储到item类型的对象中
import scrapy
from scrapyWork.items import ScrapyworkItem
class QiubaiproSpider(scrapy.Spider):
...
def parse(self, response):
...
item = ScrapyworkItem()
item['author'] = author
item['content'] = content
- 将item类型的对象提交给管道进行持久化存储的操作
import scrapy
from scrapyWork.items import ScrapyworkItem
class QiubaiproSpider(scrapy.Spider):
...
def parse(self, response):
...
item = ScrapyworkItem()
item['author'] = author
item['content'] = content
# 将item提交给管道
yield item
- 在管道类的process_item中要将其接收到的item对象中存储的数据进行持久化存储
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class ScrapyworkPipeline:
# 重写父类的open_spider方法,存储数据
def open_spider(self, spider):
self.fp = open('./qiubai.txt', 'w', encoding='utf-8')
print('开始爬虫')
# 专门用来处理item类型对象
def process_item(self, item, spider):
author = item['author']
content = item['content']
self.fp.write(author+':'+content+'\n')
return item
def close_spider(self, spider):
self.fp.close()
print('结束爬虫')
- 在配置文件中开启管道
ITEM_PIPELINES = {
# 300表示优先级, 数值越小优先级越高
'scrapyWork.pipelines.ScrapyworkPipeline': 300,
}