初学python.. 在写某功能时感觉慢,用了线程,由于程序类似死循环,每执行一次就会不停的创建线程,在测试的过程中虽然执行很快,但是内存猛涨,有大佬帮忙看看吗。
=======================================================
import requests
from lxml import etree
import threading
lock=threading.Lock() #创建线程锁
initial = ['gif图片'] #初始集合
existed = [] #已存在的
处理得到的html源码,从中提取关键词,并写入文件
def processHTML(self):
etree_obj = etree.HTML(self)
table_th = etree_obj.xpath('//div[@id="rs"]/table//tr//th//text()')
print('当前词根:'+str(table_th))
if (len(table_th) > 0):
for item in range(len(table_th)):
# 过滤关键词,必须包含某词
if 'gif' in table_th[item]:
# 如果已存在的集合中不存在该关键词则说明它是新的,则继续累加到任务集合中
if table_th[item] not in existed:
initial.append(table_th[item])
# 一直操作IO,不加锁,错误时文件内容乱码
with lock:
# 将最新采集到的词追加写入文件中
with open('keyword.txt', "a", encoding='utf-8') as file:
file.write(table_th[item] + "\n")
print('待采集:' + str(len(initial)))
print('已得到:' + str(len(existed)))
if(len(initial)>0):
with lock:
runStart()
线程下载源码
def threadDown(self):
userAgen = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36 Edg/81.0.41'
}
try:
str = requests.get(self, headers=userAgen, timeout=30)
# 传递源码,提取关键词
processHTML(str.text)
except Exception as Err:
print('异常:' + str(Err))
生成关键词采集url
def create_url(self):
print('当前关键词:'+self)
threadDown('https://www.baidu.com/s?wd='+self+'&pn=0')
def runStart():
i = 0
while i < len(initial):
# print(initial[i])
existed.append(initial[i])
# create_url(initial[i]) #传递关键词生成采集url
run = threading.Thread(target=create_url, args=(initial[i],))
run.start()
initial.pop(i)
if name == 'main':
runStart()
上面代码运行会一直不停的创建线程,直到内存99%
关于功能,是想让它一直不停的获取关键词,所以每次获取完都会判断inital集合,然后再次启动,理想的方式是自动判断inital集合中有多少需要执行的任务,然后自动分配多少线程去执行并且释放,例如数量超过100,则分配5个线程跑,超过500则分10个或更多,如何写,有点迷,希望有空前辈指点指点