初学scrapy,利用xpath对网页结构进行匹配,写的parse函数如下:
def parse(self, response):
teacherList = response.xpath('//div[@class="li_txt"]')
teacherItem = []
for node in teacherList:
item = ItcastItem()
name = node.xpath('./h3/text()').extract()
title = node.xpath('./h4/text()').extract()
info = node.xpath('./p/text()').extract()
item['name'] = name[0]
item['title'] = title[0]
item['info'] = info[0]
teacherItem.append(item)
执行
scrapy crawl itcast -o teacher.json
之后发现json文件里的东西是没有编码的数据,查阅相关资料后,发现在settings.py文件中加一句话便可以完美解决问题:
FEED_EXPORT_ENCODING = 'utf-8'
关于转成excel的.csv文件,根据实测utf-8的编码不合适,所以应转成gbk编码格式:
FEED_EXPORT_ENCODING = 'gbk'