python中爬虫函数_多线程如何在python3爬虫中调用函数？

最新推荐文章于 2022-03-26 20:23:56 发布

weixin_39852647

最新推荐文章于 2022-03-26 20:23:56 发布

阅读量120

点赞数

文章标签： python中爬虫函数

可以说函数和python爬虫一般情况下都可以结合使用，但是这需要小伙伴们对于函数的使用方法进行充分的了解，才能和python爬虫的知识点紧密结合使用。经过几天多线程和爬虫的内容讲解，相信大家对于这方面的模块内容已经比较熟悉的，所以可以用我们的老朋友download（）函数进行一次知识点的交流，下面就来来看看download（）在python爬虫中的运用吧。

对其进行构造，创建日志，download（）函数创建线程，update_queque_url对连接的列表进行更新,get_url()根据bs4进行匹配获取连接，download_all()通过调用download（）函数实现批量下载。spider作为一个入口函数进行爬取class Crawler:

def __init__(self,name,domain,thread_number):

self.name=name

self.domain=domain

self.thread_number=thread_number

self.logfile=open('log.txt','w')

self.thread_pool=[]

self.url = 'http://'+domain

def spider(self):# 内容会随着爬虫的进行而更新

global g_queue_urls# 初始，队列中仅有一个url

g_queue_urls.append(self.url)# 爬取的深度

depth =0

print(f'爬虫{self.name}开始启动........')

while g_queue_urls:

depth +=1

print(f'当前爬取深度是{depth}')

self.logfile.write(f'URL:{g_queue_urls[0]}')

self.download_all() # 下载所有

self.update_queque_url() # 更新 url队列

self.logfile.write(f">>>Depth:{depth}")

count = 0

while count

self.logfile.write(f"累计爬取{g_total_count}条，爬取是第{g_queue_urls[count]}个")

count+=1

def download_all(self):

global g_queue_urls

global g_total_count

i=0

while i < len(g_queue_urls):

j=0

while j

g_total_count +=1

print(g_queue_urls[i+j])

thread_result=self.download(g_queue_urls[i+j],f"{g_total_count}.html",j)

if thread_result is not None:

print(f'线程{i+j}启动')

j +=1

i=i+j

for thread in self.thread_pool:

thread.join(25)

g_queue_urls=[]

def download(self,url,filename,tid):

print(url,filename,tid)

creawler_thread= CrawlerThread(url,filename,tid)

self.thread_pool.append(creawler_thread)

creawler_thread.start()

def update_queque_url(self):

global g_queue_urls

global g_exist_urls#已经爬过的url

new_urls=[]#新发现的url

for url_content in g_urls:

new_urls +=self.get_Url(url_content)# 从页面中提取新url

g_queue_urls=list(set(new_urls) -set(g_exist_urls)) # 去除重复的和已经爬过的

def get_Url(self,content):

'''

从网页源代码中提取url

'''

links =[] # 保存提取到的href

try:

soup =BeautifulSoup(content)

for link in soup.findAll('a'):

if link is not None and link.get('href') is not None:

if self.domain in link['href']:

# 如果link是本网站的绝对地址

links.append(link)

elif len(link['href']) >10 and 'http://' not in link['href']:

# 如果link是相对地址

links.append(self.url +link['href'])

except Exception as e:

print("fail to get url",e)

return links

主函数调用爬虫函数的spider()方法if __name__=="__main__":

domain ="www.geyanw.com"

thread_number=10

name="geyan"

crawler =Crawler(name,domain,thread_number)

crawler.spider()

除了download（）函数，spider()也可以在python爬虫中调用。这两个函数我们之前也提到过不少次，想具体了解使用的可以看看之前的讲解。更多Python学习指路:PyThon学习网教学中心。

weixin_39852647

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中爬虫函数_多线程如何在python3爬虫中调用函数？

可以说函数和python爬虫一般情况下都可以结合使用，但是这需要小伙伴们对于函数的使用方法进行充分的了解，才能和python爬虫的知识点紧密结合使用。经过几天多线程和爬虫的内容讲解，相信大家对于这方面的模块内容已经比较熟悉的，所以可以用我们的老朋友download（）函数进行一次知识点的交流，下面就来来看看download（）在python爬虫中的运用吧。对其进行构造，创建日志，download（...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。