十、学习分布式爬虫之多线程

最新推荐文章于 2022-10-18 01:03:12 发布

RichRichJay

最新推荐文章于 2022-10-18 01:03:12 发布

阅读量210

点赞数

分类专栏： python爬虫文章标签： python

本文链接：https://blog.csdn.net/Mr_Little_li/article/details/104338955

版权

python爬虫专栏收录该内容

15 篇文章 0 订阅

订阅专栏

多线程的GIL锁

python自带的解释器是CPython，CPython解释器的多线程实际上是一个假的多线程（在多核CPU中，只能利用一核，不能利用多核）。同一个时刻只有一个线程在执行，为了保证同一时刻只有一个线程在执行，在CPython解释器中有一个东西叫做GIL（Global Intepreter Lock）全局解释器锁。这个解释器锁是有必要的，因为CPython解释器的内存管理不是线程安全的。当然除了CPython解释器，还有其他解释器，但有些解释器是没有GIL锁的：

Jython：用Java实现的python解释器，不存在GIL锁。
IronPython：用.net实现的python解释器，不存在GIL锁。
PyPy：用Python实现的Python解释器，存在GIL。
GIL虽然是一个假的多线程，但是在处理一些IO操作（比如文件读写和网络请求）还是可以很大程度上提高效率的，在IO操作上建议使用多线程提高效率，在一些CPU计算操作上不建议使用多线程，而建议使用多进程。
有了GIL，为什么还需要Lock
GIL只是保证全局同一时刻只有一个线程在执行，但是无法保证代码执行的原子性（要么不做，要么做完）。也就是说一个操作可能被分成几个部分完成，这样就会导致数据有问题，所以需要使用Lock来保证操作的原子性。
多线程爬取百思不得姐段子

import requests
import re
import threading
from queue import Queue
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36'
}

class Producer(threading.Thread):
    def __init__(self,url_queue,info_queue,*args,**kwargs):
        super(Producer, self).__init__(*args,**kwargs)
        self.url_queue = url_queue
        self.info_queue = info_queue
    def run(self):
        while not self.url_queue.empty():
            url = self.url_queue.get()
            resp = requests.get(url,headers=headers)
            text = resp.text
            text_infos = re.findall(r"""
                <div.+?<li.+?j-r-list.+?<div.+?u-txt.+?<a.+?u-user-name.+?>(.+?)</a>  #用户名
                .+?<span.+?>(.+?)</span>  #发布时间
                .+?<div.+?j-r-list-c-desc.+?<a.+?>(.+?)</a>  #发表内容
                """, text, re.VERBOSE | re.DOTALL)
            for text_info in text_infos:
                # print(text_info,len(text_infos))
                self.info_queue.put(text_info)
            time.sleep(0.5)

class Consumer(threading.Thread):
    def __init__(self,info_queue,*args,**kwargs):
        super(Consumer, self).__init__(*args,**kwargs)
        self.info_queue = info_queue
    def run(self):
        while True:
            try:
                with open('budejie.txt', 'a', encoding='utf-8') as f:
                    info = self.info_queue.get(timeout=10)  #元组
                    f.write('用户名：%s,发布时间：%s,发布内容：%s\n'%(info[0],info[1],info[2].replace("<br />","")))
                    print("写入成功！")
            except Exception as e:
                print(e)
                break

def main():
    url_queue = Queue(10)
    info_queue = Queue(1000) #中间容器
    for x in range(1,10):
        url = 'http://www.budejie.com/text/{}'.format(x)
        url_queue.put(url)

    for x in range(5):
        th = Producer(url_queue,info_queue,name="生产者%d号"%x)
        th.start()

    for x in range(3):
        th = Consumer(info_queue,name='消费者%d号'%x)
        th.start()

if __name__ == '__main__':
    main()

动态网页爬虫

动态网页即是网站在不重新加载的情况下，通过ajax技术动态更新网站中的局部数据。比如拉勾网的职位页面，在换页的过程中，url是没有发生改变的，但是职位数据动态的更改了。
AJAX（Asynchronouse JavaScript XML）异步Javascript和XML。前端与服务器进行少量数据交换，Ajax可以使网页实现异步更新，这意味着可以在不重新加载整个网页的情况下，对网页的某部分进行更新。
动态网页爬虫的解决方案

直接分析ajax调用的接口，然后通过代码请求这个接口。
使用Selenium + chromedriver模拟浏览器行为获取数据

selenium和chrome介绍
selenium相当于是一个机器人，可以模拟人类在浏览器上的一些行为，自动处理浏览器上的一些行为，比如点击，填充数据，删除cookie等。chromedriver是一个驱动chrome浏览器的驱动程序，使用它才可以驱动浏览器。
selenium的安装
pip install selenium 注意：要在python安装路径的Script目录下进行安装，否则会提示pip不可以
selenium的基本使用

from selenium import webdriver
import time

#初始化一个driver并指定chromedriver的路径
driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe")

#请求网页
# driver.get('https://www.baidu.com')
# time.sleep(4)
#
# driver.close()  #关闭当前页面
# driver.quit()  #退出整个浏览器

########################定位元素################################
#根据id查找元素
# input_tag = driver.find_element_by_id("kw")
# input_tag.send_keys("python") #在input标签中填充值

#根据类名查找元素
# input_tag = driver.find_element_by_class_name("s_ipt")
# input_tag.send_keys("python")

#根据name属性查找元素
# input_tag = driver.find_element_by_name("wd")
# input_tag.send_keys("python")

#通过标签名查找元素
# input_tag = driver.find_element_by_tag_name("input")
# input_tag.send_keys("python")

#根据xpath语法来获取元素
# input_tag = driver.find_element_by_xpath("//input[@id='kw']")
# input_tag.send_keys("python")

#根据css选择器选择元素
# input_tag = driver.find_element_by_css_selector("#form #kw")
# input_tag.send_keys("python")

########################操作表单元素################################
#知乎网站相关的测试代码，主要是用来验证输入框和按钮的
# driver.get("https://www.zhihu.com/signin?next=%2F")
#
# password_login = driver.find_element_by_class_name("SignFlow-tab--active")
# password_login.click()
#
# username_tag = driver.find_element_by_name("username")
# username_tag.send_keys("15797954414")
#
# password_tag = driver.find_element_by_xpath("//input[@name='password']")
# password_tag.send_keys("wuyuli@.abc")
#
# submit_btn = driver.find_element_by_class_name("SignFlow-submitButton")
# submit_btn.click()

#豆瓣网站相关的测试代码，主要用来验证checkbox
# driver.get("https://accounts.douban.com/passport/login_popup?login_source=anony")
# checkbox = driver.find_element_by_name("remember")
# checkbox.click()

selenium行为链
在这里插入图片描述

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains

driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe")

driver.get("https://www.zhihu.com/signin?next=%2F")

#创建行为链对象
actions = ActionChains(driver)

tag = driver.find_elements_by_class_name("SignFlow-tab")  #知乎有两种登录方式，这里切换到第二种登录方式
actions.move_to_element(tag[1])
actions.click()

actions.perform()  #执行切换操作

username_tag = driver.find_element_by_name("username")
password_tag = driver.find_element_by_name("password")
submit_btn = driver.find_element_by_class_name("SignFlow-submitButton")

actions.move_to_element(username_tag)  #移动鼠标到这个标签上
actions.send_keys_to_element(username_tag,"15*******414")  #填充值到这个标签里

actions.move_to_element(password_tag)
actions.send_keys_to_element(password_tag,"wu********c")

actions.move_to_element(submit_btn)
actions.click()

actions.perform()#执行登录操作

为什么需要行为链条
因为有些网站可能会在浏览器端做一些验证是否符合人类的行为来做反爬虫，这时候我们就可以使用行为链来模拟人的操作。
selenium操作cookie
在这里插入图片描述

from selenium import webdriver

driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe")

driver.get("https://www.baidu.com")

#获取所有cookie
# cookies = driver.get_cookies()
# for cookie in cookies:
#     print(cookie)

#根据key获取cookie
# cookie = driver.get_cookie("BD_UPN")
# print(cookie)

#添加cookie
# driver.add_cookie({"name":'username',"value":'123'})
# cookies = driver.get_cookies()
# for cookie in cookies:
#     print(cookie)

#根据key删除某个cookie
# driver.delete_cookie("username")

#删除所有cookie
# driver.delete_all_cookies()

selenium页面等待
现在的网页越来越多采用了 Ajax 技术，这样程序便不能确定何时某个元素完全加载出来了。如果实际页面等待
时间过长导致某个dom元素还没出来，但是你的代码直接使用了这个WebElement，那么就会抛出NullPointer
的异常。为了解决这个问题。所以 Selenium 提供了两种等待方式：一种是隐式等待、一种是显式等待。

隐式等待：指定一个时间，在这个时间内一直会处于等待状态。
显式等待：指定在某个时间内，如果某个条件满足了，那么就不会继续等待，如果在指定时间内条件都不满足，则会报错。显示等待应该使用selenium.webdriver.support.excepted_conditions期望的条件和selenium.webdriver.support.ui.WebDriverWait来配合完成。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe")

#1.隐式等待
# driver.get("https://www.baidu.com")
# driver.implicitly_wait(10) #等待10秒
# driver.find_element_by_class_name("asdasda")

#2.显式等待
driver.get("https://kyfw.12306.cn/otn/leftTicket/init?linktypeid=dc")

#等待100秒让用户执行括号里的操作
WebDriverWait(driver,100).until(
    EC.text_to_be_present_in_element_value((By.ID,"fromStationText"),"上海") #当某个元素中呈现某个文本时
)
print('出发')

WebDriverWait(driver,100).until(
    EC.text_to_be_present_in_element_value((By.ID,"toStationText"),"北京")
)
print('终点')

btn = driver.find_element_by_id("query_ticket")
btn.click()

在这里插入图片描述
selenium打开新窗口和切换页面

from selenium import webdriver

driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe")

driver.get("https://www.baidu.com/")

driver.implicitly_wait(5)

driver.execute_script("window.open('https://www.douban.com/')")  #打开新窗口

# print(driver.page_source)  #此源代码是百度的源代码

# print(driver.window_handles)  #一个是百度的窗口，一个是豆瓣的窗口
driver.switch_to.window(driver.window_handles[1])  #进行窗口切换
print(driver.page_source)  # 此源代码是豆瓣的源代码

selenium设置代理
有时候频繁爬取一些网页，服务器发现你是爬虫后会封掉你的ip地址，这时候我们可以更改代理ip。

from selenium import webdriver

options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://123.149.141.133:9999")  #添加代理

driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe",chrome_options=options)  #添加chrome_options
driver.get('http://www.httpbin.org/ip')

selenium补充
在这里插入图片描述

# from selenium import webdriver

# options = webdriver.ChromeOptions()
# options.add_argument("--proxy-server=http://123.149.141.133:9999")  #添加代理
#
# driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe",chrome_options=options)  #添加chrome_options
# driver.get('http://www.httpbin.org/ip')

from selenium import webdriver

driver = webdriver.Chrome(executable_path="D:\chromedriver\chromedriver.exe")

# driver.get(r'D:\pyPro\爬虫进阶\abc.html')

# div = driver.find_element_by_id("mydiv")

# print(div.get_property("id"))  #mydiv
# print(div.get_property("data-name"))  #None

# print(div.get_attribute("id"))  #mydiv
# print(div.get_attribute("data-name"))  #xyz

driver.get("https://www.baidu.com/")
driver.save_screenshot("baidu.png")  #获取当前页面的截图

RichRichJay

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
十、学习分布式爬虫之多线程

多线程的GIL锁python自带的解释器是CPython，CPython解释器的多线程实际上是一个假的多线程（在多核CPU中，只能利用一核，不能利用多核）。同一个时刻只有一个线程在执行，为了保证同一时刻只有一个线程在执行，在CPython解释器中有一个东西叫做GIL（Global Intepreter Lock）全局解释器锁。这个解释器锁是有必要的，因为CPython解释器的内存管理不是线程安全...
复制链接

扫一扫