python爬虫进程and线程

最新推荐文章于 2024-05-04 00:00:01 发布

梦亦殇

最新推荐文章于 2024-05-04 00:00:01 发布

阅读量221

点赞数 2

分类专栏： python 文章标签： python爬虫进程线程

本文链接：https://blog.csdn.net/weixin_42163525/article/details/84727105

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

多任务

定义:就是操作系统可以同时运行多个任务
真正的并行执行多任务只能在多核CPU上实现，但是，由于任务数量远远多于CPU核心数量，所以，操作系统也会自动把很多任务轮流调度到每个核心上执行

并发：指的是任务数多余cpu核数，通过操作系统的各种任务调度算法，实现用个任务“一起”执行（实际上总有一些任务不在执行，因为切换任务的速度相当快，看上去一起执行而已）
并行：指的是任务数小于等于cpu核数，即任务真的是一起执行的

线程和进程的关系:
进程：一个程序运行起来后，代码+用到的资源称之为进程，它是操作系统分配资源的基本单元
线程：是程序执行流的最小单元,线程是进程中的一个实体，是被系统独立调度和分派的基本单位(线程是依赖进程而存在的)

线程(Thread)

python的thread模块是比较底层的模块，python的threading模块是thread做了一些包装的，可以更加方便的被使用

Thread的参数:
- target : 线程执行的函数
- name :　线程的名称
- args : 线程执行的函数的参数(是元组类型)
- daemon : 如果为False 主线程结束时会检测该子线程是否结束，如果该子线程还在运行，则主线程会等待它完成后再退出如果为True 则不会检测
  主线程是不会等待子线程　除非你添加了　join()　方法
  栗子：

import threading
import time

def work1(parameter):
    for i in range(30):
        print('ssss',str(i),threading.currentThread().name,parameter)
        # threading.currentThread().name　该线程的名字　可以自己指定
def work2():
    for i in range(30):
        print('xxxxxxx',i,threading.currentThread().name)

def main():
    t1 = threading.Thread(target=work1,name='哈哈哈',args=('我是参数',))
    t2 = threading.Thread(target=work2,name='啊啊啊啊')

    t1.start()
    # 创建线程以后　要用start()方法来启动
    t2.start()
    # join()让主线程　等待子线程　
    t2.join()
if __name__ == '__main__':
    main()
    print('我是主线程，已经结束',threading.currentThread().name)

线程-全局变量共享　和　锁

线程的全局变量是共享的　这就意味着:线程是对全局变量随意遂改可能造成多线程之间对全局变量的混乱（即线程非安全）
因此需要　加锁
栗子:

import threading

lock = threading.Lock()
lockq = threading.Lock()
# 枷锁
# lock.acquire()
#
# # 解锁
# lock.release()

a = 0

def aaaaaa():
    for i in range(1000000):
        global a
        lock.acquire()
        a+=1
        lockq.release()

def bbbbb():
    for i in range(1000000):
        global a
        lockq.acquire()
        a+=1
        lock.release()


def main():
    qq = threading.Thread(target=aaaaaa)
    ww = threading.Thread(target=bbbbb)

    qq.start()
    ww.start()
    qq.join()
    ww.join()
    print(a)
if __name__ == '__main__':
    main()

注意：加锁的时候一定要小心谨慎，否则会出现死锁的问题
死锁：在线程间共享多个资源的时候，如果两个线程分别占有一部分资源并且同等待对方的资源，就会造成死锁

线程队列　Queue

队列是线程间最常用的交换数据的形式
Queue，是线程安全的，因此在满足使用条件下，建议使用队列

Queue的参数
- myqueue.put(10) 网队列添加一个参数
- myqueue.get() 从队列取出一个参数
- Queue.qsize() 返回队列的大小
- Queue.empty() 如果队列为空，返回True,反之False
- Queue.full() 如果队列满了，返回True,反之False
- Queue.full 与 maxsize 大小对应
- Queue.get([block[, timeout]])获取队列，timeout等待时间

import Queue


def work1(queue):
    while not queue.empty():
        url = queue.get()
        request.get(url)

def main():
    myqueue = Queue.Queue(maxsize = 10)
    for i in range(4):
        url = 'https://xxx.com/page/'+str(i)
        myqueue.put(url)
        work1(myqueue)

线程池

干货：栗子

# ------------线程池---------

import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor

def download_article_list(req_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    }
    response= requests.get(url=req_url,headers=headers)
    if response.status_code == 200:
        html = response.text
        return html
def parse_data_by_callback(future):
    html = future.result()
    html = etree.HTML(html)
    data_list = html.xpath('//div[@class="post floated-thumb"]')
    for data in data_list[:1]:
        title = '|'.join(data.xpath('.//div[@class="post-meta"]/p[1]//text()')).replace(' ', '').replace('\r\n', '')
        print(title)
def main():
    pool = ThreadPoolExecutor(10)
    # 往线程池 添加任务
    for i in range(1, 11):
        ful_url = 'http://blog.jobbole.com/all-posts/page/{}/'.format(str(i))
        pl = pool.submit(download_article_list,ful_url)
        pl.add_done_callback(parse_data_by_callback)
if __name__ == '__main__':
    main()

进程(process)

进程：一个程序运行起来后，代码+用到的资源称之为进程，它是操作系统分配资源的基本单元
其实类似与线程

Process 参数:
- target : 如果传递了函数的引用，可以任务这个子进程就执行这里的代码
- args : 给target指定的函数传递的参数，以元组的方式传递
- kwargs：给target指定的函数传递命名参数
- name：给进程设定一个名字，可以不设定
Process创建的实例对象的常用方法
- start()：启动子进程实例（创建子进程）
- is_alive()：判断进程子进程是否还在活着
- join([timeout])：是否等待子进程执行结束，或等待多少秒
- terminate()：不管任务是否完成，立即终止子进程

栗子：

from multiprocessing import Process


def work1(a):
    print(a)


def work2():
    print('aaaaa')


def main():
    print(11111111111111111111111111111)
    pr1 = Process(target=work1, args=('傻逼',))
    pr2 = Process(target=work2)
    pr1.start()
    pr2.start()

    pr1.join()
    pr2.join()
    print(22222222222222222222222222222)


if __name__ == '__main__':
    main()

进程通讯-Queue

进程之间是不共享全局变量的
Process之间有时需要通信，操作系统提供了很多机制来实现进程间的通信。
这个Queue与之前的线程的Queue 是不同的　注意导入的包
不过该Queue 和线程的Queue 的使用方法是差不多的

# 实现进程之间的通讯
    # 受用multiprocessing 下的　queue 可以实现　资源共享
    from multiprocessing import Queue, Process
    import os


    def write(dataqueue):
        for i in range(10):
            dataqueue.put(i)
        print(os.getpid(), '执行完毕')


    def read(dataqueue):
        while not dataqueue.empty():
            print(dataqueue.get())


    def main():
        data_queue = Queue()
        process = Process(target=write, args=(data_queue,))
        process.start()
        process.join()
        p2 = Process(target=read, args=(data_queue,))
        p2.start()
        p2.join()


    if __name__ == '__main__':
        main()

进程池

方式一：使用的multiprocessing 的进程池

from multiprocessing import Pool
import os,time
def runtest(num):
    print('进程开启'+str(os.getpid()))
    time.sleep(2)
    # print(num)
    print('进程结束'+str(os.getpid()))
    return num,num

def done(future):
    print(future)

#构建一个进程池
p = Pool(4)
for i in range(0,50):
    #func：表示方法（函数）的名称，args：方法（函数）的参数是一个tuple（元组），
    #callback回调函数(不一定要写，看需求)
    p.apply_async(func=runtest,args=(i,),callback=done)

#close()表示关闭进程池，不能再往里面添加任务
p.close()
p.join()

进程池之间的通讯:使用Manger下面的Queue() 其他的方法都差不多

    from multiprocessing import Manager, Pool
    import time

    def write(queue, num):
        for i in range(num):
            queue.put(i)
        print('存储任务结束')


    def read(queue):
        print(queue.get())
    def main():
        # 使用Manage
        q = Manager().Queue()

        pool = Pool()
        print(111)
        pool.apply_async(func=write,args=(q,2,))
        print(2222)
        time.sleep(3)
        pool.apply_async(func=read,args=(q,))
        pool.close()
        pool.join()
    if __name__ == '__main__':
        main()

方式二：使用concurrent.futures 里的进程池

from concurrent.futures import ProcessPoolExecutor
import time,os
#创建一个进程池
def runtest(num):
    print('进程开启'+str(os.getpid()))
    time.sleep(2)
    # print(num)
    print('进程结束'+str(os.getpid()))
    return num

def done(future):
    print(future.result())

pool = ProcessPoolExecutor(4)
for i in range(0,20):
    handler =  pool.submit(runtest,(i,))
    handler.add_done_callback(done)

pool.shutdown(wait=True)

协程

通俗的理解：在一个线程中的某个函数，可以在任何地方保存当前函数的一些临时变量等信息，然后切换到另外一个函数中执行，注意不是通过调用函数的方式做到的，并且切换的次数以及什么时候再切换到原来的函数都由开发者自己确定
推荐使用gevent　这个第三库可以自动帮你在等待的时候后切换

安装：pip3 install gevent

from gevent import monkey,pool
import gevent,requests
import lxml.etree as etree

# 有耗时操作时需要
monkey.patch_all()  # 将程序中用到的耗时操作的代码，换为gevent中自己实现的模块


def download(url):
    print(url+'正在下载1')
    header = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
    response = requests.get(url,headers=header)
    print(len(response.text),url+'已完成１')

def download2(url):
    print(url+'正在下载2')
    header = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
    response = requests.get(url,headers=header)
    print(len(response.text),url+'已完成2')

pool = pool.Pool(2)

gevent.joinall(
    [
        pool.spawn(download,'https://www.yahoo.com/'),
        pool.spawn(download,'https://www.taobao.com/'),
        pool.spawn(download,'https://github.com/'), 
        pool.spawn(download2,'https://www.yahoo.com/'),
        pool.spawn(download2,'https://www.taobao.com/'),
        pool.spawn(download2,'https://github.com/'), 
    ]
)

总结

进程是资源分配的单位
线程是操作系统调度的单位

进程切换需要的资源很最大，效率很低
线程切换需要的资源一般，效率一般（当然了在不考虑GIL的情况下
协程切换任务资源很小，效率高

多进程、多线程根据cpu核数不一样可能是并行的，但是　协程　是在一个　线程中所以是　并发的.

梦亦殇

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
python爬虫进程and线程

多任务定义:就是操作系统可以同时运行多个任务真正的并行执行多任务只能在多核CPU上实现，但是，由于任务数量远远多于CPU核心数量，所以，操作系统也会自动把很多任务轮流调度到每个核心上执行并发：指的是任务数多余cpu核数，通过操作系统的各种任务调度算法，实现用个任务“一起”执行（实际上总有一些任务不在执行，因为切换任务的速度相当快，看上去一起执行而已）并行：指的是任务数小于等于cpu核数，...
复制链接

扫一扫