从贵州省数据开放平台下载科技特派员csv文件,文件地址为http://gzopen.oss-cn-guizhou-a.aliyuncs.com/科技特派员.csv
- 使用命令创建项目
>>>scrapy startproject csvfeedspider
- 进入项目目录
>>>cd csvfeedspider
>>>scrapy genspider -t csvfeed csvdata gzdata.gov.cn
- 编写items文件
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class CsvfeedspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 姓名
name = scrapy.Field()
# 研究领域
SearchField = scrapy.Field()
# 服务分类
Service = scrapy.Field()
# 专业特长
Specialty = scrapy.Field()
- 编写爬虫文件csvdata.py
from scrapy.spiders import CSVFeedSpider
from csvfeedspider.items import CsvfeedspiderItem
class CsvdataSpider(CSVFeedSpider):
name = 'csvdata'
allowed_domains = ['gzdata.gov.cn']
start_urls = ['http://gzopen.oss-cn-guizhou-a.aliyuncs.com/科技特派员.csv']
headers = ['name', 'SearchField', 'Service', 'Specialty']
delimiter = ','
quotechar = '\n'
# Do any adaptations you need here
def adapt_response(self, response):
return response.body.decode('gb18030')
def parse_row(self, response, row):
i = CsvfeedspiderItem()
i['name'] = row['name']
i['SearchField'] = row['SearchField']
i['Service'] = row['Service']
i['Specialty'] = row['Specialty']
return i
在adapt_response()方法中,我们对response做了编码处理,使之能正常的提取中文数据。
- 运行爬虫
>>>scrapy crawl csvdata