Python高级编程和异步IO并发编程（二）

最新推荐文章于 2024-09-18 20:39:31 发布

weixin_30795127

最新推荐文章于 2024-09-18 20:39:31 发布

阅读量117

点赞数

文章标签： python 爬虫操作系统

原文链接：http://www.cnblogs.com/Eric15/articles/9769044.html

版权

Python高级编程和异步IO并发编程

一、多线程、多进程和线程池编程

1、GIL

　gil global interpreter lock （全局解释器锁）

　python中一个线程对应于c语言中的一个线程（cpython）

　gil使得同一个时刻只有一个线程在一个cpu上执行字节码, 无法将多个线程映射到多个cpu上执行

GIL主动释放的情况：

　gil会根据执行的字节码行数以及时间片释放gil
　gil在遇到io的操作时候会主动释放

2、多线程编程 --threading

　操作系统能调度的最小单元是线程

　对于io操作来说，多线程和多进程性能差别不大

setdaemon 方法：守护线程，主线程运行完毕，立刻结束所有线程
join 方法：线程阻塞，无论主线程是否运行完毕，都需等待子线程执行完成才能结束

创建多线程两种方式

1）直接使用：

import time
import threading

def get_detail_html(url):
    print("get detail html started")
    time.sleep(2)
    print("get detail html end")

def get_detail_url(url):
    print("get detail url started")
    time.sleep(4)
    print("get detail url end")

if  __name__ == "__main__":
    thread1 = threading.Thread（target=get_detail_html）
    thread1 = threading.Thread（target=get_detail_html）
    #thread1.setDaemon(True)
    #thread2.setDaemon(True) # 守护线程

    start_time = time.time()
    thread1.start()
    thread2.start()

    thread1.join()
    thread2.join()   # 线程阻塞

    print ("last time: {}".format(time.time()-start_time))

2）使用thread继承方式：

　推荐这种方式，我们可以在类中做更多需要的处理。

class GetDetailHtml(threading.Thread):
    def __init__(self, name):
        super().__init__(name=name)

    def run(self):   # 重载 threading.Thread 中的方法
        print("get detail html started")
        time.sleep(2)
        print("get detail html end")

class GetDetailUrl(threading.Thread):
    def __init__(self, name):
        super().__init__(name=name)

    def run(self): 
        print("get detail url started")
        time.sleep(4)
        print("get detail url end")

if  __name__ == "__main__":
    thread1 = GetDetailHtml("get_detail_html")
    thread2 = GetDetailUrl("get_detail_url")
    start_time = time.time()
    thread1.start()
    thread2.start()

    thread1.join()
    thread2.join()

    print ("last time: {}".format(time.time()-start_time))

3、线程间通信：共享变量和 Queue

1）共享变量的方式，即全局定义一个变量，给多个子线程中调用。安全性不高，不建议使用

2）Queue 队列的方式，比较安全，推荐使用.Queue常用方法可查看源码

#通过queue的方式进行线程间同步，更安全
from queue import Queue
import time
import threading

def get_detail_html(queue):
    # 爬取文章详情页
    while True:
        url = queue.get()  # queue get方法 阻塞
        print("get detail html started")
        time.sleep(2)
        print("get detail html end")

def get_detail_url(queue):
    # 爬取文章列表页
    while True:
        print("get detail url started")
        time.sleep(4)
        for i in range(20):
            queue.put("http://projectsedu.com/{id}".format(id=i))  # queue put 方法
        print("get detail url end")

if  __name__ == "__main__":
    detail_url_queue = Queue(maxsize=1000)

    thread_detail_url = threading.Thread(target=get_detail_url, args=(detail_url_queue,))
    for i in range(10):
        html_thread = threading.Thread(target=get_detail_html, args=(detail_url_queue,))
        html_thread.start()
    start_time = time.time()

    detail_url_queue.task_done()  结束队列阻塞
    detail_url_queue.join()  # 队列阻塞

    print ("last time: {}".format(time.time()-start_time))

4. 线程同步：Lock、RLock

1）Lock

from threading import Lock

lock = Lock()

    lock.acquire()   # 获取锁
    lock.acquire()   # 上面已获取锁，再acquire则会阻塞住，即在锁未释放前，两次acquire就会造成死锁
    total += 1
    lock.release()   # 释放锁，在释放锁之前其他线程都会被hold住，等待锁释放
    lock.release()

注意：1、用锁会影响性能

　　　 2、用锁注意造成死锁问题：

　　　　1）未释放锁前，连续两次acquire 会造成死锁

　　　　2）资源竞争：即线程一要的资源在线程二中，线程二要的资源在线程一中，线程一获取锁需要用到线程二的资源，然而锁在线程一中，线程二锁住（阻塞住）无法将资源给到线程一，因此造成死锁

2）RLock

可重入的锁，在同一个线程里面，可以连续调用多次acquire，一定要注意acquire的次数要和release的次数相等

from threading import Lock, RLock, Condition

#RLock可重入的锁，在同一个线程里面，可以连续调用多次acquire， 一定要注意acquire的次数要和release的次数相等

total = 0
lock = RLock()
def add():
    global lock
    global total
    for i in range(1000000):
        lock.acquire()   # 获取锁
        lock.acquire()   # RLock ，不会造成死锁
        total += 1
        lock.release()   # 释放锁
        lock.release()  # release次数需与acquire次数一致

RLock的可用性要远远高于Lock，如果要使用锁，建议使用RLock

from threading import Lock, RLock, Condition


total = 0
lock = RLock()
def add():
    global lock
    global total
    for i in range(1000000):
        lock.acquire()   # 获取锁
        lock.acquire()   
        total += 1
        lock.release()   # 释放锁
        lock.release()

def desc():
    global total
    global lock
    for i in range(1000000):
        lock.acquire()
        total -= 1
        lock.release()

import threading
thread1 = threading.Thread(target=add)
thread2 = threading.Thread(target=desc)
thread1.start()
thread2.start()

thread1.join()
thread2.join()
print(total)

demo

5、线程同步 - condition 使用以及源码分析

　condition中实现了魔法函数：__enter__、__exit__ ，是上下文管理器，可以用with处理

#通过condition完成协同读诗

import threading

class XiaoAi(threading.Thread):
    def __init__(self, cond):
        super().__init__(name="小爱")
        self.cond = cond

    def run(self):
        with self.cond:
            self.cond.wait()
            print("{} : 在 ".format(self.name))
            self.cond.notify()

            self.cond.wait()
            print("{} : 好啊 ".format(self.name))
            self.cond.notify()

            self.cond.wait()
            print("{} : 君住长江尾 ".format(self.name))
            self.cond.notify()

            self.cond.wait()
            print("{} : 共饮长江水 ".format(self.name))
            self.cond.notify()

            self.cond.wait()
            print("{} : 此恨何时已 ".format(self.name))
            self.cond.notify()

            self.cond.wait()
            print("{} : 定不负相思意 ".format(self.name))
            self.cond.notify()

class TianMao(threading.Thread):
    def __init__(self, cond):
        super().__init__(name="天猫精灵")
        self.cond = cond

    def run(self):
        with self.cond:
            print("{} : 小爱同学 ".format(self.name))
            self.cond.notify()
            self.cond.wait()

            print("{} : 我们来对古诗吧 ".format(self.name))
            self.cond.notify()
            self.cond.wait()

            print("{} : 我住长江头 ".format(self.name))
            self.cond.notify()
            self.cond.wait()

            print("{} : 日日思君不见君 ".format(self.name))
            self.cond.notify()
            self.cond.wait()

            print("{} : 此水几时休 ".format(self.name))
            self.cond.notify()
            self.cond.wait()

            print("{} : 只愿君心似我心 ".format(self.name))
            self.cond.notify()
            self.cond.wait()

if __name__ == "__main__":
    from concurrent import futures
    cond = threading.Condition()
    xiaoai = XiaoAi(cond)
    tianmao = TianMao(cond)

    xiaoai.start()
    tianmao.start()
    #启动顺序很重要
    #在调用with cond之后才能调用wait或者notify方法
    #condition有两层锁， 一把底层锁会在线程调用了wait方法的时候释放， 上面的锁会在每次调用wait的时候分配一把并放入到cond的等待队列中，等到notify方法的唤醒

6、线程同步 - Semaphore 使用以及源码分析

Semaphore：信号量，是用于控制进入数量的锁。本质上是锁，Lock是单锁，信号量是指定多把锁，也就是说通过信号量指定多个数线程可以访问相同资源，一般情况下读操作可以有多个，但写操作同时只有一个

Semaphore 管理一个计数器,每调用一次 acquire() 方法,计数器就减一,每调用一次 release() 方法,计数器就加一。计时器的值默认为 1 ,计数器的值不能小于 0,当计数器的值为 0 时,调用 acquire() 的线程就会等待,直到 release() 被调用。因此,可以利用这个特性来控制线程数量

控制爬虫数量，一次最多执行3个线程：

import threading
import time
# 控制爬虫数量，一次执行三次
class HtmlSpider(threading.Thread):
    def __init__(self, url, sem):
        super().__init__()
        self.url = url
        self.sem = sem

    def run(self):
        time.sleep(2)
        print("got html text success")
        self.sem.release()       # Semaphore.release()，释放锁
class UrlProducer(threading.Thread):
    def __init__(self, sem):
        super().__init__()
        self.sem = sem

    def run(self):
        for i in range(20):
            self.sem.acquire()    # Semaphore.acquire()，每运行一个爬虫自动减1，当3个爬虫都运行时,计数器减为零，此时再调用acquire方法则会hold
            html_thread = HtmlSpider("https://baidu.com/{}".format(i), self.sem)
            html_thread.start()

if __name__ == "__main__":
    sem = threading.Semaphore(3)  # 控制爬虫数量为3个
    url_producer = UrlProducer(sem)
    url_producer.start()

7、ThreadPoolExecutor线程池

　线程池使用：不仅仅是数量控制，可以获取线程状态、任务状态、线程返回值等信息；当一个线程完成的时候我们主线程能立即知道； futures可以让多线程和多进程编码接口一致。

　线程池模块　　ThreadPollExecutor

　线程池使用过程：

实例化线程池
提交任务，会有个返回对象，submit是不会堵塞，立即返回
让主线程等待线程执行完成
关闭线程池

　线程池几个方法：

done() ：判断任务是否完成
result() ：获取任务执行结果，会阻塞
cancle()：取消任务，任务在执行中或者已经执行完成则无法取消

from concurrent.futures import ThreadPoolExecutor
import time

def get_html(times):
    time.sleep(times)
    print("get page {} success".format(times))
    return times
# 程序执行时，会到线程池执行线程
executor = ThreadPoolExecutor(max_workers=2) # 线程数最大两个
# 通过submit函数提交执行的函数到线程池中, submit不会阻塞，有个返回值，线程状态等可以通过这个返回值查看
task1 = executor.submit(get_html, (3))
task2 = executor.submit(get_html, (2))

# 状态查询：
# done方法用于判定某个任务是否完成
# print(task1.done())     # 返回：False
# print(task2.cancel())   # 可以取消没有执行的
# time.sleep(3)
# print(task1.done())     # 返回：True

# #result方法可以获取task的执行结果
# print(task1.result())   # 返回：3

1）futures下的as_completed()方法：获取已经执行完成的任务的结果（推荐）

from concurrent.futures import as_completed
urls = [3,2,4]
all_task = [executor.submit(get_html, (url)) for url in urls]
# wait(all_task, return_when=FIRST_COMPLETED) 
for future in as_completed(all_task): as_completed是个生成器，执行完成的任务都能获取到
    data = future.result()
    print("get {} page".format(data)) # 打印执行成功的任务

2）线程池自带的map()方法：获取已经执行完成的任务结果

from concurrent.futures import ThreadPoolExecutor, as_completed, wait, FIRST_COMPLETED

import time

def get_html(times):
    time.sleep(times)
    print("get page {} success".format(times))
    return times

executor = ThreadPoolExecutor(max_workers=2)

#通过executor的map获取已经完成的task的值
for data in executor.map(get_html, urls):
    print("get {} page".format(data))

3）futures下的wait()方法：等待线程完成，才执行下面的程序

urls = [3,2,4]
all_task = [executor.submit(get_html, (url)) for url in urls]
wait(all_task, return_when=FIRST_COMPLETED)   # 阻塞，return_when（条件），FIRST_COMPLETED：第一个线程执行完成才能执行下面的，未完成前会阻塞住
print("main")
for future in as_completed(all_task):
    data = future.result()
    print("get {} page".format(data))

8、multiprocessing 多进程编程

多进程参考：http://www.cnblogs.com/kaituorensheng/p/4445418.html

1）多进程创建方式

方式一：

import multiprocessing
import time

def worker_1(interval):
    print "worker_1"
    time.sleep(interval)
    print "end worker_1"

def worker_2(interval):
    print "worker_2"
    time.sleep(interval)
    print "end worker_2"

def worker_3(interval):
    print "worker_3"
    time.sleep(interval)
    print "end worker_3"

if __name__ == "__main__":
    p1 = multiprocessing.Process(target = worker_1, args = (2,))
    p2 = multiprocessing.Process(target = worker_2, args = (3,))
    p3 = multiprocessing.Process(target = worker_3, args = (4,))

    p1.start()
    p2.start()
    p3.start()

    print("The number of CPU is:" + str(multiprocessing.cpu_count()))
    for p in multiprocessing.active_children():
        print("child   p.name:" + p.name + "\tp.id" + str(p.pid))
    print "END!!!!!!!!!!!!!!!!!"

# 结果：
The number of CPU is:4
child   p.name:Process-3    p.id7992
child   p.name:Process-2    p.id4204
child   p.name:Process-1    p.id6380
END!!!!!!!!!!!!!!!!!
worker_1
worker_3
worker_2
end worker_1
end worker_2
end worker_3

方式二：

import multiprocessing
import time

class ClockProcess(multiprocessing.Process):
    def __init__(self, interval):
        multiprocessing.Process.__init__(self)
        self.interval = interval

    def run(self):
        n = 5
        while n > 0:
            print("the time is {0}".format(time.ctime()))
            time.sleep(self.interval)
            n -= 1

if __name__ == '__main__':
    p = ClockProcess(3)
    p.start()

# 结果：
the time is Tue Apr 21 20:31:30 2015
the time is Tue Apr 21 20:31:33 2015
the time is Tue Apr 21 20:31:36 2015
the time is Tue Apr 21 20:31:39 2015
the time is Tue Apr 21 20:31:42 2015

2）进程池：

方式一：

from concurrent.futures import ProcessPoolExecutor
import requests
import time

def task(url):
    response = requests.get(url)
    print(url,response)
    # 写正则表达式


pool = ProcessPoolExecutor(7)
url_list = [
    'http://www.cnblogs.com/wupeiqi',
    'http://huaban.com/favorite/beauty/',
    'http://www.bing.com',
    'http://www.zhihu.com',
    'http://www.sina.com',
    'http://www.baidu.com',
    'http://www.autohome.com.cn',
]
for url in url_list:
    pool.submit(task,url)

pool.shutdown(wait=True)

方式二：

from concurrent.futures import ProcessPoolExecutor
import requests
import time

def task(url):
    response = requests.get(url)
    return response

def done(future,*args,**kwargs):
    response = future.result()
    print(response.status_code,response.content)

pool = ProcessPoolExecutor(7)
url_list = [
    'http://www.cnblogs.com/wupeiqi',
    'http://huaban.com/favorite/beauty/',
    'http://www.bing.com',
    'http://www.zhihu.com',
    'http://www.sina.com',
    'http://www.baidu.com',
    'http://www.autohome.com.cn',
]
for url in url_list:
    v = pool.submit(task,url)
    v.add_done_callback(done)

pool.shutdown(wait=True)

3）进程下的pool

使用进程池（非阻塞）：

#coding: utf-8
import multiprocessing
import time

def func(msg):
    print "msg:", msg
    time.sleep(3)
    print "end"

if __name__ == "__main__":
    pool = multiprocessing.Pool(processes = 3)
    for i in xrange(4):
        msg = "hello %d" %(i)
        pool.apply_async(func, (msg, ))   #维持执行的进程总数为processes，当一个进程执行完毕后会添加新的进程进去

    print "Mark~ Mark~ Mark~~~~~~~~~~~~~~~~~~~~~~"
    pool.close()
    pool.join()   #调用join之前，先调用close函数，否则会出错。执行完close后不会有新的进程加入到pool,join函数等待所有子进程结束
    print "Sub-process(es) done."

函数解释：

apply_async(func[, args[, kwds[, callback]]]) 它是非阻塞，apply(func[, args[, kwds]])是阻塞的（理解区别，看例1例2结果区别）
close() 关闭pool，使其不在接受新的任务。
terminate() 结束工作进程，不在处理未完成的任务。
join() 主进程阻塞，等待子进程的退出， join方法要在close或terminate之后使用。

执行说明：创建一个进程池pool，并设定进程的数量为3，xrange(4)会相继产生四个对象[0, 1, 2, 4]，四个对象被提交到pool中，因pool指定进程数为3，所以0、1、2会直接送到进程中执行，当其中一个执行完事后才空出一个进程处理对象3，所以会出现输出“msg: hello 3”出现在"end"后。因为为非阻塞，主函数会自己执行自个的，不搭理进程的执行，所以运行完for循环后直接输出“mMsg: hark~ Mark~ Mark~~~~~~~~~~~~~~~~~~~~~~”，主程序在pool.join（）处等待各个进程的结束。

使用进程池（阻塞）：

#coding: utf-8
import multiprocessing
import time

def func(msg):
    print "msg:", msg
    time.sleep(3)
    print "end"

if __name__ == "__main__":
    pool = multiprocessing.Pool(processes = 3)
    for i in xrange(4):
        msg = "hello %d" %(i)
        pool.apply(func, (msg, ))   #维持执行的进程总数为processes，当一个进程执行完毕后会添加新的进程进去

    print "Mark~ Mark~ Mark~~~~~~~~~~~~~~~~~~~~~~"
    pool.close()
    pool.join()   #调用join之前，先调用close函数，否则会出错。执行完close后不会有新的进程加入到pool,join函数等待所有子进程结束

9、进程间通信 - Queue、Pipe，Manager

1）Queue队列

　是 multiprocessing自带的 Queue

import time
from multiprocessing import Process, Queue
# from queue import Queue  #不是这个Queue

def producer(queue):
    queue.put("a")
    time.sleep(2)

def consumer(queue):
    time.sleep(2)
    data = queue.get()
    print(data)

if __name__ == "__main__":
    queue = Queue(10)
    my_producer = Process(target=producer, args=(queue,))
    my_consumer = Process(target=consumer, args=(queue,))
    my_producer.start()
    my_consumer.start()
    my_producer.join()
    my_consumer.join()

multiprocessing中的Queue不能用于multiprocessing下的pool进程池，pool中的进程间通信需要使用manager中的queu

目前我们已经说到了三个Queue：

from queue import Queue ：用于线程通信
from multiprocessing import Queue ：用于进程通信
from multiprocessing import Manager → q=Manager().Queue() ：用于multiprocessing下的pool进程池的进程通信

2）Pipe 管道

multiprocessing下的Pipe ，Pipe只能用于两个进程间的通信，Pipe的性能高于Queue的

from multiprocessing import Process, Pipe

def producer(pipe):
    pipe.send("MJ")

def consumer(pipe):
    print(pipe.recv())

if __name__ == "__main__":
    recevie_pipe, send_pipe = Pipe()
    #pipe只能适用于两个进程间通信
    my_producer= Process(target=producer, args=(send_pipe, ))
    my_consumer = Process(target=consumer, args=(recevie_pipe,))

    my_producer.start()
    my_consumer.start()
    my_producer.join()
    my_consumer.join()