Scrapy爬虫模板——csvfeed
scrapy startproject mycsv
scrapy genspider -l
scrapy genspider -t csvfeed mycsvspider "iqianyue.com"
创建完成后,mycsvspider.py文件如下:
# -*- coding: utf-8 -*-
from scrapy.spiders import CSVFeedSpider
class MycsvspiderSpider(CSVFeedSpider):
name = 'mycsvspider'
allowed_domains = ['iqianyue.com']
start_urls = ['http://iqianyue.com/feed.csv']
# headers = ['id', 'name', 'description', 'image_link']
# delimiter = '\t'
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
i = {}
#i['url'] = row['url']
#i['name'] = row['name']
#i['description'] = row['description']
return i
其中
- headers属性:主要存放csv文件中包含的用于提取字段的行信息的列表。
- delimiter属性:主要存放字段之间的间隔符。
- parse_row()方法:主要用来接收一个response对象
我们要爬取的网址:http://yum.iqianyue.com/weisuenbook/pyspd/part12/mydata.csv
修改mycsvspider.py文件如下:
# -*- coding: utf-8 -*-
from scrapy.spiders import CSVFeedSpider
from csvpjt.items import CsvpjtItem
class SteveSpider(CSVFeedSpider):
name = 'steve'
allowed_domains = ['xxx.com']
start_urls = ['http://xxx.com/mydata.csv']
# headers:主要存放在CSV文件中包含的用于提取字段的行信息的列表
headers = ['name', 'sex', 'addr', 'email']
# delimiter:主要存放字段之间的间隔符
delimiter = ','
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
#接收一个response对象并进行对应的处理
def parse_row(self, response, row):
item = CsvpjtItem()
item["name"] = row['name'].encode()
item["sex"] = row['sex'].encode()
print("名字是:")
print(item["name"])
print("性别是:")
print(item["sex"])
print("--------------------------------------")
return item
运行程序:
scrapy crawl mycsvspider --nolog