爬虫（三）多线程、多进程

最新推荐文章于 2022-12-17 20:20:01 发布

北落师门XY

最新推荐文章于 2022-12-17 20:20:01 发布

阅读量136

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/weixin_41819299/article/details/85548320

版权

大数据专栏收录该内容

7 篇文章 1 订阅

订阅专栏

一、git代码来源

https://github.com/shenxiangzhuang/PythonDataAnalysis/tree/master/Ch1Spider/muti-threads

经过咨询作者，将最后两行代码交换一下，否则后一种方法无法获取到url

二、知识点

三、代码


import re
import time
import requests
import concurrent
from concurrent import futures
import pandas as pd
import threading
from multiprocessing import Pool
from fake_useragent import UserAgent

def get_csv_files():
    url='http://www.hao123.com/'
    ua=UserAgent()
    headers={'User-Agent': ua.random}
    respect=requests.get(url,headers=headers)
    data=respect.text
    urls=re.findall(r'href="(http.*?)"',data)
    print(len(urls))#1033

    df=pd.DataFrame()
    df['url']=urls[:1000]
    df.to_csv('123.csv',index=None)

# 从文件取n个网址测试
def get_urls_from_file(n):
    df = pd.read_csv('123.csv')  # 共1000个网址
    urls = list(df['url'][:n])
    return urls

# 装饰器，打印函数的执行时间
def gettime(func):
    def warapper(*args, **kwargs):
        print("=" * 50)
        print(func.__name__, 'Start...')
        starttime = time.time()
        func(*args)
        endtime = time.time()
        spendtime = endtime - starttime
        print(func.__name__, "End...")
        print("Spend", spendtime, "s totally")
        print("=" * 50)

    return warapper





# 请求并解析网页获取数据（这里简单把要获取的数据设为网页源码）
def getdata(url, retries=3):
    # print("正在下载:", url)
    headers = {}
    try:
        html = requests.get(url, headers=headers)
        # print(html)

    except requests.exceptions.ConnectionError as e:
        # print('下载出错[ConnectionError]:', e)
        html = None

        # 5xx 错误为服务器错误,我们可以进行重新请求，这里可请求三次
    if (html != None and 500 <= html.status_code < 600 and retries):
        retries -= 1
        # print('服务器错误正在重试...')
        getdata(url, retries)
        data = html.text
    else:
        data = None

    return data


# 串行
@gettime
def Mynormal():
    for url in urls:
        getdata(url)


# 进程池
#简单说，就是先创建10个进程的进程池，然后用map的方法安排任务，然后close不再丢任务，join等待子进程运行结束
@gettime
def MyprocessPool(num=10):
    #Pool可以提供指定数量的进程供用户调用当有新的请求提交到pool中时，如果池还没有满，那么就会创建一个新的进程用来执行该请求；
    # 但如果池中的进程数已经达到规定最大值，那么该请求就会等待，直到池中有进程结束，才会创建新的进程来处理它。
    pool = Pool(num)#    from multiprocessing import Pool
    results = pool.map(getdata, urls)#将第二个数组参数的每一个值提取出来作为第一个函数getdata的参数

    pool.close()#关闭pool，使其不在接受新的任务
    pool.join()#join函数等待所有子进程结束，主进程阻塞，等待子进程的退出
    return results


# 多线程
@gettime
def Mymultithread(max_threads=10):
    # 对urls的处理，取一个url进行处理
    def urls_process():
        while True:
            try:
                # 从urls末尾抽出一个url
                url = urls.pop()
            except IndexError:
                # urls爬取完毕，为空时，结束
                break
            data = getdata(url, retries=3)
            '''
            这里是对网页数据的提取与存储操作
            '''

    threads = []

    # 未达到最大线程限制且仍然存在带爬取的url时，可以创建新的线程进行加速
    while int(len(threads) < max_threads) and len(urls):
        thread = threading.Thread(target=urls_process)#创建一个线程
        # print('创建线程', thread.getName())
        thread.start()#启动线程
        threads.append(thread)
        print('len(urls) :',len(urls))

    for thread in threads:
        thread.join()#线程等待，主线程不会等待子线程执行完毕再结束自身。可使用join方法子线程执行完毕以后，主线程再关闭


# 线程池
@gettime
def Myfutures(num_of_max_works=10):
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_of_max_works) as executor:#创建线程池
        executor.map(getdata, urls)


if __name__ == '__main__':
    # get_csv_files()#获取123.csv文件
    urls = get_urls_from_file(100)    # 　取100个网页做测试
    Mynormal()  # 串行
    MyprocessPool(10)  # 进程池
    Myfutures(10)  # 线程池
    Mymultithread(10)  # 多线程

四、运行结果：

==================================================
Mynormal Start...
Mynormal End...
Spend 29.723828554153442 s totally
==================================================
==================================================
MyprocessPool Start...
MyprocessPool End...
Spend 7.878110885620117 s totally
==================================================
==================================================
Myfutures Start...
Myfutures End...
Spend 5.8683295249938965 s totally
==================================================
==================================================
Mymultithread Start...
Mymultithread End...
Spend 6.390867710113525 s totally
==================================================

参考代码来源：《Python数据分析入门》

北落师门XY

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫（三）多线程、多进程

一、git代码来源https://github.com/shenxiangzhuang/PythonDataAnalysis/tree/master/Ch1Spider/muti-threads经过咨询作者，将最后两行代码交换一下，否则后一种方法无法获取到url二、知识点三、代码import reimport timeimport requestsimport ...
复制链接

扫一扫