将爬取过的url放进set()中,利用pickle包来序列化内存,保存在文件中:
def save_progress(self,path,data):
"""
保存进度
:return:
"""
with open(path,'wb') as f:
pickle.dump(data,f)
但是,随着爬取的数据量越来越大,会非常的消耗内存,所以,利用hashlib包将url经过MD5处理后来减少好几倍的内存消耗:
def get_new_url(self):
"""
从容器中获取新的url,并且转化成md5减少内存消耗加进old_urls
:return:
"""
new_url = self.new_urls.pop()
m = hashlib.md5()
m.update(new_url.encode('utf-8'))
md5_url = m.hexdigest()
self.old_urls.add(md5_url)
return new_url
下次启动爬虫,从保存进度的文件中读取到内存,继续上一次的爬取:
def load_progress(self,path):
'''
从本地文件加载进度
:return: 返回set()集合
'''
try:
with open(path, 'rb') as f:
tmp = pickle.load(f)
print('继续%s的进程' % path)
return tmp
except FileNotFoundError as e:
print(e,'无进度文件,创建:%s'%path)
return set()
需要判断是否为首次爬取:
if os.path.exists('python_job_old_urls.txt'):
old_urls = crawl.url_manager.load_progress('python_job_old_urls.txt')
crawl.url_manager.old_urls = old_urls