多线程爬虫

最新推荐文章于 2024-07-07 23:56:48 发布

小白_橙子

最新推荐文章于 2024-07-07 23:56:48 发布

阅读量2.1k

点赞数 3

分类专栏： python 爬虫文章标签：多线程爬虫

本文链接：https://blog.csdn.net/weixin_43958804/article/details/86515941

版权

python 同时被 2 个专栏收录

52 篇文章 0 订阅

订阅专栏

爬虫

8 篇文章 0 订阅

订阅专栏

多线程爬虫

全局解释器锁GIL 控制着Python的线程能否得到CPU的计算资源，正是这个锁来控制同一时刻只有一个线程能够运行。

多线程的编码方式可以分为两种：第一种是面向对象式编程，第二种是面向函数式编程。

队列的类型:

先进先出 queue.Queue(maxsize=0)
后进先出 queue.LifoQueue(maxsize=0)
优先队列 queue.PriorityQueue(maxsize=0)

队列

# 申请一个队列 maxsize为最大连接数,maxsize<=0表示对连接没有限制
import queue
myqueue = queue.Queue(maxsize=10)

队列的方法

queue.qsize()    # 队列的长度
queue.full()     # 队列是否是满的
queue.empty()    # 队列是否为空
queue.put()      # 给队列添加数据
queue.get()      # 获取队列的数据

多线程爬虫实例

# 使用多线程爬取读书网的书名
import requests
from threading import Thread
import queue
from lxml import etree


class Mythread(Thread):
    def __init__(self, queue):
        # 实例化父类对象
        super(Mythread, self).__init__()
        self.queue = queue
        # ip代理
        self.proxy = {
            "http": '111.40.84.73:9999',
            'https': '111.40.84.73:9999'
        }
        # 请求头
        self.headers = {
            "User-Agent": 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'
        }

    def run(self):
        self.spider()

    def spider(self):
        # 判断队列是否为空
        while not self.queue.empty():
            # 获取队列的数据
            url = self.queue.get()
            # 发出请求得到响应
            response = requests.get(url, headers=self.headers, proxies=self.proxy)
            # 使用xpath提取title
            html = etree.HTML(response.text)
            title = html.xpath('//div[@class="bookslist"]/ul/li/div/h3/a/text()')
            print(title)


def main():
    # 声明一个队列
    myqueue = queue.Queue()
    # 产生多个地址
    for v in range(1, 4):
        url = "https://www.dushu.com/lianzai/1115_{}.html".format(v)
        myqueue.put(url)
	
    allthread = []
    # 实例化多个线程
    for v in range(5):
        mythread = Mythread(myqueue)
        mythread.start()
        allthread.append(mythread)
	
    # 等待多个子线程结束
    for v in allthread:
        v.join()


if __name__ == '__main__':
    main()

超级鹰平台

验证码识别

# 模拟登陆豆瓣
import requests
import re
from chaojiying import Chaojiying_Client

# 产生一个会话
session = requests.Session()

url = 'https://www.douban.com/login'

# 声明请求头
headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'
}

res = session.get(url, headers=headers)
print(res.text)

# 获取验证码
cap_url = re.findall('<img id="captcha_image" src="(.*?)&amp;size=s"', res.text)[0]
print(cap_url)
# 导入超级鹰的验证模块,from chaojiying import Chaojiying_Client
capimg = session.get(cap_url, headers=headers).content
with open("cap.jpg", 'wb') as fp:
    fp.write(capimg)

# 获取验证码id
capid = re.findall('<input type="hidden" name="captcha-id" value="(.*?)"/>', res.text)[0]
print(capid)
# 识别验证码

chaojiying = Chaojiying_Client('whmreset', 'whm19961216', '898438')  # 用户中心>>软件ID 生成一个替换 96001
im = open('cap.jpg', 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
cap = chaojiying.PostPic(im, 1007)['pic_str']
print(cap)

data = {
    'source': 'None',
    'redir': 'https://www.douban.com',
    'form_email': '豆瓣账号',
    'form_password': '豆瓣密码',
    'captcha-solution': cap,
    'captcha-id': capid,
    'login': '登录',
}
# 提交数据
res = session.post(url, data=data, headers=headers)

# 检测是否登录成功
index = "https://www.douban.com/"
response = session.get(index, headers=headers)
with open('dou.html', 'w', encoding='utf-8') as fp:
    fp.write(response.text)

scrapy的安装

pip install scrapy

可能出现的问题

Twisted 安装不上,可以选择进行离线安装

离线地址 : https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

根据自己解释器的版本进行选择

pip install Twisted-18.9.0-cp37-cp37m-win_amd64.whl

再次 pip install scrapy
在cmd 输入scrapy，会得到下面的情况，但不代表已经安装成功
cmd 输入scrapy bench，可能出现模块的缺少 win32api模块
pip install pywin32
完成之后再次输入scrapy bench 可以查看到本机的运行效率等,此时才安装成功

小白_橙子

关注

3
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
多线程爬虫

多线程爬虫全局解释器锁GIL 控制着Python的线程能否得到CPU的计算资源，正是这个锁来控制同一时刻只有一个线程能够运行。多线程的编码方式可以分为两种：第一种是面向对象式编程，第二种是面向函数式编程。队列的类型:先进先出 queue.Queue(maxsize=0)后进先出 queue.LifoQueue(maxsize=0)优先队列 queue...
复制链接

扫一扫

专栏目录