协程gevent并发爬网页

最新推荐文章于 2023-06-06 01:02:12 发布

LA7388

最新推荐文章于 2023-06-06 01:02:12 发布

阅读量345

点赞数

原文链接：http://www.cnblogs.com/xiangjun555/articles/7737089.html

版权

一、前言

刚刚只是在理论上讲述了gevent遇到io自动切换，下面我们就来实际操作一下，在实战过程中我们用协程大面积的爬虫，看看如何用gevent去实现并发的效果的。

二、串行爬网页

2.1、串行爬网页

说明：我们先来看看串行效果的爬网页的代码，看看消耗多长时间

from urllib import request   #简单的爬虫模块，复杂的不用这个
import time
 
def f(url):
    print("GET:{0}".format(url))
    resp = request.urlopen(url)    #request.urlopen()函数
    data = resp.read()    #读取爬到的数据
    with open("url.html","wb") as f:
        f.write(data)
    print('{0} bytes received from {1}'.format(len(data), url))
 
urls = [
    'http://www.163.com/',
    'https://www.yahoo.com/',
    'https://github.com/'
]
time_start = time.time()    #开始时间
for url in urls:
    f(url)
print("同步cost",time.time()-time_start)  #程序执行消耗的时间

执行结果如下：

GET:http://www.163.com/
658380 bytes received from http://www.163.com/
GET:https://www.yahoo.com/
468153 bytes received from https://www.yahoo.com/
GET:https://github.com/
55467 bytes received from https://github.com/
同步cost 5.505090951919556    #程序消耗的时间

2.2、gevent协程爬虫

说明：刚刚是串行的执行的，我们现在用gevent并发执行一下，看看效果。

from urllib import request
import gevent,time
 
def f(url):
    print("GET:{0}".format(url))
    resp = request.urlopen(url)
    data = resp.read()
    with open("url.html","wb") as f:
        f.write(data)
    print('{0} bytes received from {1}'.format(len(data), url))
 
async_time_start = time.time()
gevent.joinall([                     #用gevent启动协程
    gevent.spawn(f,'http://www.163.com/'),  #第二个值是传入参数，之前我们没有讲，因为前面没有传参
    gevent.spawn(f,'https://www.yahoo.com/'),
    gevent.spawn(f,'https://github.com/'),
])
print("异步cost",time.time()-async_time_start)   #计算时间

执行结果如下：

GET:http://www.163.com/
658380 bytes received from http://www.163.com/
GET:https://www.yahoo.com/
466264 bytes received from https://www.yahoo.com/
GET:https://github.com/
55459 bytes received from https://github.com/
异步cost 6.204461574554443  #执行时间

问题：为啥我用了并发，执行的时间没有缩短，反而变的更长了呢？

　　其实urllib默认跟gevent是没有关系的。urllib现在默认，如果你要通过gevent来去调用，它就是阻塞，gevent现在检测不到urllib的IO操作。它都不知道urllib进行了IO操作，所以它都不会进行切换，所以它就串行了。所以这个urllib和我们之前学的socket交给gevent不好使，因为gevent它不知道你进行了IO操作，所以就会卡住。

三、并发爬网页

　　既然上面那种情况都不行，那怎么让gevent知道urllib正在进行IO操作呢？

　　答：打补丁，通过导入monkey，来打这个补丁，在程序中什么都不写，就添加一行monkey.patch()即可。

3.1、代码

from urllib import request
import gevent,time
from gevent import monkey   #导入monkey
 
monkey.patch_all()  #把当前程序的所有的io操作给我单独的作上标记，且就执行这一句即可
 
def f(url):
    print("GET:{0}".format(url))
    resp = request.urlopen(url)
    data = resp.read()
    with open("url.html","wb") as f:
        f.write(data)
    print('{0} bytes received from {1}'.format(len(data), url))
 
urls = [
    'http://www.163.com/',
    'https://www.yahoo.com/',
    'https://github.com/'
]
time_start = time.time()
for url in urls:
    f(url)
print("同步cost",time.time()-time_start)  #串行时间计算
 
async_time_start = time.time()
gevent.joinall([
    gevent.spawn(f,'http://www.163.com/'),
    gevent.spawn(f,'https://www.yahoo.com/'),
    gevent.spawn(f,'https://github.com/'),
])
print("异步cost",time.time()-async_time_start)   #并发的时间的时间计算

执行结果：

GET:http://www.163.com/
658315 bytes received from http://www.163.com/
GET:https://www.yahoo.com/
467577 bytes received from https://www.yahoo.com/
GET:https://github.com/
55467 bytes received from https://github.com/
同步cost 4.895136833190918    #同步执行的结果
GET:http://www.163.com/
GET:https://www.yahoo.com/
GET:https://github.com/
658315 bytes received from http://www.163.com/
471042 bytes received from https://www.yahoo.com/
55467 bytes received from https://github.com/
异步cost 3.0067789554595947   #异步执行的结果

　　哈哈，看到效果了吧，其实差距不大，还有一个原因就是网络的原因也有。总之这个是需要通过打补丁的。其实就是说通过打补丁来检测到它有urllib，它就把urllib里面所有涉及到的有可能进行IO操作的地方直接花在前面加一个标记，这个标记就相当于gevent.sleep()，所以把urllib变成一个一有阻塞，它就切换了。

　　注意了，gevent.sleep()是模拟IO操作的，标记的意思是，这边是IO操作，遇到阻塞就切换。

四、gevent实现单线程下的多socket并发

4.1、server端

import sys,gevent,socket,time
from gevent import socket,monkey
monkey.patch_all()
  
def server(port):
    s = socket.socket()
    s.bind(('0.0.0.0', port))
    s.listen(500)
    while True:
        cli, addr = s.accept()
        gevent.spawn(handle_request, cli)   #协程
 
def handle_request(conn):
    try:
        while True:
            data = conn.recv(1024)
            print("recv:", data)
            conn.send(data)
            if not data:
                conn.shutdown(socket.SHUT_WR)
    except Exception as  ex:
        print(ex)
    finally:
        conn.close()
if __name__ == '__main__':
    server(8001)

4.2、client端

import socket
  
HOST = 'localhost'    # The remote host
PORT = 8001           # The same port as used by the server
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))
while True:
    msg = bytes(input(">>:"),encoding="utf8")
    s.sendall(msg)
    data = s.recv(1024)
    print('Received', repr(data))
s.close()

转载于:https://www.cnblogs.com/xiangjun555/articles/7737089.html

LA7388

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
协程gevent并发爬网页

一、前言刚刚只是在理论上讲述了gevent遇到io自动切换，下面我们就来实际操作一下，在实战过程中我们用协程大面积的爬虫，看看如何用gevent去实现并发的效果的。二、串行爬网页2.1、串行爬网页说明：我们先来看看串行效果的爬网页的代码，看看消耗多长时间from urllib import reques...
复制链接

扫一扫