一、python的安装以及scrapy的安装
http://www.scrapyd.cn/doc/ 这是scrapy的中文文档,里面的scrapy的安装方法很好用,强烈推荐。
二、对于爬虫的编写、遇到的问题以及解决方法
按照scrapy中文网站的流程,创建项目,编写,css选择器,翻页,里面都有很详细的说明,如果数据想要写入txt文档,那很简单,但是我的需求是写入mysql,数据量不算大,但是也还可以,所以,写入mysql的过程中出现了,很多问题。
首先:
在items.py中创建表属性,一个表就一个class,多个表就创建多个class:
class ProjectItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
companyName = scrapy.Field()
companyRank = scrapy.Field()
companyCode = scrapy.Field()
proQitName = scrapy.Field()
proQit = scrapy.Field()
def get_insert_sql(self):
insert_sql = """"""插入语句""""""
params = (self['companyName'],self['companyCode'],self['proQit'],self['proQitName'])
return insert_sql,params
class GreenConst(scrapy.Item):
companyName = scrapy.Field()
companyRank = scrapy.Field()
companyCode = scrapy.Field()
greenConst = scrapy.Field()
greenConstName = scrapy.Field()
def get_insert_sql(self):
insert_sql = """插入语句"""
params = (self['companyName'],self['companyCode'],self['greenConst'],self['greenConstName'])
return insert_sql,params
class Safety(scrapy.Item):
companyName = scrapy.Field()
companyRank = scrapy.Field()
companyCode = scrapy.Field()
proSafety = scrapy.Field()
proSafetyName = scrapy.Field()
def get_insert_sql(self):
insert_sql = """插入语句"""
params = (self['companyName'],self['companyCode'],self['proSafety'],self['proSafetyName'])
return insert_sql,params
我是创建了三张表,所以就三个class,至于定义的方法为什么在items中,是因为mysql插入数据我选择的是异步插入,如果插入语句定义在pipelines.py中的话,会造成写入数据不匹配(这个原因我是看的一个博客上这么说的,没测试)。
定义pipelines.py
import pymysql
from twisted.enterprise import adbapi
import pymysql.cursors
class ProjectPipeline(object):
def __init__(self,dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls, settings):
dbpool = adbapi.ConnectionPool("pymysql", host=settings["MYSQL_HOST"], db=settings["MYSQL_DB"],
user=settings["MYSQL_USER"], password=settings["MYSQL_PASSWORD"], charset=settings['CHARSET'],
cursorclass=pymysql.cursors.DictCursor,use_unicode = True)
return cls(dbpool)
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self.do_insert, item)
query.addErrback(self.handle_error, item)
def handle_error(self, failure, item):
print(u'插入数据失败,原因:{},错误对象:{}'.format(failure, item))
def do_insert(self,cursor,item):
insert_sql, args = item.get_insert_sql()
cursor.execute("建表语句")
cursor.execute("建表语句")
cursor.execute("建表语句")
cursor.execute(insert_sql, args)
在pipelines.py中建立与mysql的链接,因为是异步插入,所以连接框架使用twisted,连接成功之后也不用管是否关闭,框架会自动关闭。在定义的do_insert方法中,我的建表语句就是创建三张表,与items相对应,而cursor.execute会将三张表的数据自动分别存储到相应的表中。
settings.py 编写
MYSQL_HOST = ''
MYSQL_USER = ''
MYSQL_PASSWORD = ''
MYSQL_PORT = 3306
MYSQL_DB = ''
CHARSET = ''
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False
失败重试次数
RETRY_ENABLED = True
RETRY_TIMES = 3
下载延迟
DOWNLOAD_DELAY = 1
启用pipelines
ITEM_PIPELINES = {
'project.pipelines.ProjectPipeline': 300,
}
对于scrapy请求头的伪装我只做了很简单的伪装,在spiders中,动态Ip没有做。
spider的编写
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.5,en;q=0.3",
"Accept-Encoding": "gzip, deflate",
'Content-Length': '0',
"Connection": "keep-alive",
"Cache-Control": "no-cache "
}
def parse(self, response):
meta = response.request.meta
draw = meta.get('draw')
number = meta.get('number')
start = meta.get('start')
#获取响应头的Cache-Control
cookie = response.headers.get("Cache-Control")
if cookie == b'no-cache':
a= response.text
data0 = json.loads(a)
allname = data0['rows']
for number in range(0,10):
finall = json.dumps(allname[number])
finall = json.loads(finall)
TotalRank = finall['TotalRank']
EnterpriseCode = finall['EnterpriseCode']
EnterpriseName = finall['EnterpriseName']
ComputeDate = finall['ComputeDate']
def text(response):
a = response.css(".table-hover td::text").extract()
#工程质量数据爬取
def content(response):
meta = response.meta
EnterpriseName = meta['EnterpriseName']
EnterpriseCode = meta['EnterpriseCode']
cookie = response.headers.get("Cache-Control")
if cookie != b'private':
print("error")
yield scrapy.Request(
"",
callback=content,dont_filter = True, headers=self.headers,
meta=dict(EnterpriseCode=EnterpriseCode, EnterpriseName=EnterpriseName))
else:
a = response.css(".table-hover td::text").extract()
a = '\\N'.join(a)
a = a.replace('\n', '').replace('\r', '').replace('\t', '')
a = a.split('\\N')
for i in range(0, len(a), 3):
if i + 1 < len(a):
s = a[i+1].replace('\\','')
item = ProjectItem()
filename = 'gczl'+self.date+'.txt'
write = EnterpriseName + '\t' + EnterpriseCode + '\t' + a[i+2] + '\t' + s
with open(filename, "a+", encoding='utf8') as f:
f.write(write)
f.write('\n')
f.close()
yield scrapy.Request("",callback=content,dont_filter = True,headers=self.headers,meta=dict(EnterpriseCode=EnterpriseCode,EnterpriseName=EnterpriseName))
我就写了一点spider,并没有写完,主要是让大家看一下最简单的请求头的伪装,以及meta中数据传递方法,还真最重要的一点:
数据丢失问题
对于数据丢失问题,原因很多,我说一下可能的原因:
1.写入msyql异常
2.访问速度过快,读写不匹配
3,链接丢失
4,响应成功,但是响应页面没数据
我遇到的是第四种情况,查了好多资料,都没找到这个原因,我是通过fiddler监听访问url,查看返回头发现的,返回头中的
Cache-Control,如果是no-store 则响应界面无数据,加上这条判断,然后重新访问无数据的url就可以解决了,因为scrapy中会过滤相同的url,所以还要在Request中加上 dont_filter = True,就完美解决这个问题了。
这个过程中参考了很多博客,太多了,当时也没保存,所以这就不列举出来了。