我有大量的网站已经被镜像并且在S3上。这些网站需要为搜索建立索引(这里没有Google选项),所以我有一些Celery工人,当给出一个url列表时aws_keys执行以下操作:for k in aws_keys:
try:
req = requests.get(k, stream = True)
if req.status_code >= 400:
raise requests.RequestException('URL [{}] returned status [{}]'.format(k, req.status_code))
soup = BeautifulSoup(req.raw.data, 'lxml')
title = soup.title.text.strip()
# Extract out the style, script, document and other tags from the soup
extracts = [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
# Get the text of the page
text = soup.getText(' ', strip = True)
# save to db
soup.decompose
req.close()
req = None
except (requests.RequestException, AttributeError) as e:
self.log.error('Exception occurred accessing key [{0}]. Message: [{1}]'.format(cl_id, e.message))
continue
except UnicodeEncodeError as e:
self.log.error('Unicode Exception [{0}]'.format(e.message))
continue
现在,当aws_keys超过10000个项目时,我注意到工作进程需要越来越多的内存,以至于所有内存都被占用,需要重新启动。在
现在,我怀疑有两个地方会发生这种情况:requests和{}。但是,由于这两个对象都在循环中被销毁,所以应该对它们进行垃圾回收并释放内存。在