爬虫进阶(二)

最新推荐文章于 2022-03-09 08:28:50 发布

没蜡笔的小鑫++

最新推荐文章于 2022-03-09 08:28:50 发布

阅读量198

点赞数

分类专栏：爬虫 python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_46604741/article/details/118892406

版权

爬虫同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

python

3 篇文章 0 订阅

订阅专栏

爬虫进阶(二)

1.一些常见的反爬处理

1.1使用代理模式处理

# 代理原理通过第三方的一个机器去发送请求
import requests

proxies = {
    "http": ""
}

resp = requests.get("https://www.baidu.com", proxies=proxies)
resp.encoding = 'utf-8'
print(resp.text)

很简单

1.2模拟用户登录处理Cookie

我们从某小说网站上查看书架，发现他要我们登录。虽然可以直接在请求头中添加Cookie = xxx，但还是很麻烦。

import requests

# 会话
session = requests.session()
data = {
    "loginName": "xxx",
    "password": "xxx"
}

# 登录
url = "https://passport.17k.com/ck/user/login"
session.post(url, data=data)
# print(resp.json())

# 拿书架上的数据
# 刚才的那个session中是有cookie的
resp = session.get('https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919')
print(resp.json())

我们通过这个session会话去登录，他就能获取到登录返回的Cookie

1.3防盗链的处理

放盗链即请求某个url，他会先校验一个Referer，即是从那个url过来的，就像一个链条一样，有先后关系

import requests
url = "https://www.pearvideo.com/video_1735274"
cont_id = url.split("_")[1]
video_Status_url = f'https://www.pearvideo.com/videoStatus.jsp?contId={cont_id}&mrd=0.6952007481227842'
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.67",
    # 防盗链: 溯源，本次请求的上一级是谁
    "Referer": url
}
resp = requests.get(video_Status_url, headers=headers)
dic = resp.json()
src_url = dic['videoInfo']['videos']['srcUrl']
system_time = dic['systemTime']
src_url = src_url.replace(system_time, f"cont-{cont_id}")
print(src_url)
# 下载视频
with open("a.mp4", mode='wb') as f:
    f.write(requests.get(src_url).content)

2.使用多线程提高爬取速度

2.1python中的多线程

# 启动一个程序默认都会有一个主线程
# 多线程
from threading import Thread

def func():
    for i in range(1000):
        print("func", i)


if __name__ == '__main__':
    t = Thread(target=func)   # 创建线程并给线程安排任务，注意没有括号
    t.start() # 多线程状态为可以开始工作状态，具体执行时间由CPU决定

    # t = Thread(target=func, args=("参数"),)   # 可以传参，但是参数必须是元组，即只有一个参数，后面也必须跟上  ,
    # t.start()

    for i in range(1000):
        print("main", i)

导入一个Thread的包，把需要进行的任务放入Thread对象中。然后在start就可以。注意这个start并不代表立即开始的意思，而是说可以开始了，具体何时开始则需要操作系统决定

注意：func后面没有括号

2.2多线程另一种写法（推荐）

from threading import Thread
class MyThread(Thread): # 继承
    def run(self):        # 固定的，当线程被执行的时候，被执行的就是 run（）
        for i in range(1000):
            print("子线程", i)

if __name__ == '__main__':
    t = MyThread()
    # t.run()  # 这是方法调用 - > 单线程
    t.start()  # 开启线程
    for i in range(1000):
        print("main", i)

写一个类去继承Thread类，然后重写run方法。表示需要多线程跑的方法。

3.使用多进程提高爬取速度

一个进程有多个线程。

一般推荐使用多线程，因为创建进程对象本身也需要消耗事件

3.1python中的多进程

from multiprocessing import Process
import time


def func():
    for i in range(1000):
        print("子进程", i)
        time.sleep(0.001)


if __name__ == '__main__':
    p = Process(target=func)
    p.start()

    for i in range(1000):
        print('主进程', i)
        time.sleep(0.001)

写法类似于多线程

4.线程池和进程池

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor

def fn(name):
    for i in range(1000):
        print(name, i)

if __name__ == '__main__':
    # 创建线程池
    with ThreadPoolExecutor(50) as t:
        for i in range(100):
            t.submit(fn, name=f"线程{i}")

    # 等待线程池中的任务全部执行完毕才继续执行（守护）
    print(123)

利用池子的思想，一次性的创建多个线程（进程），这样。当某个线程用完的时候，就不会把这个线程对象销毁，而是给其他线程使用。避免重复创建的消耗

同样的，fn没有（）

可以通过name=xxx 的方式来传递参数

5.利用多线程快速抓光某网站

import requests
from lxml import etree
import csv
import time
from concurrent.futures import ThreadPoolExecutor

f = open("data.csv", mode='w', encoding="utf-8", newline='')
csv_writer = csv.writer(f)

def download_one_page(url):
    resp = requests.get(url)
    html = etree.HTML(resp.text)
    table = html.xpath("/html/body/div[2]/div[4]/div[1]/table")[0]
    trs = table.xpath("./tr[position()>1]")
    # 拿到每个tr
    for tr in trs:
        txt = tr.xpath("./td/text()")
        # 对数据做简单的处理： 去掉\\ /
        txt = (item.replace("\\", "").replace("/", "") for item in txt)
        csv_writer.writerow(txt)
    print(url, "over")

if __name__ == '__main__':
    # download_one_page('http://www.xinfadi.com.cn/marketanalysis/0/list/1.shtml')
    start = time.perf_counter()
    with ThreadPoolExecutor(50) as t:
        for i in range(1, 200):
            t.submit(download_one_page, f"http://www.xinfadi.com.cn/marketanalysis/0/list/{i}.shtml")

    # for i in range(1, 200):
    #     download_one_page(f"http://www.xinfadi.com.cn/marketanalysis/0/list/{i}.shtml")

    elapsed = (time.perf_counter() - start)
    print("Time used:", elapsed)

6.selenium终极杀招

selenium就是用程序真正的去打开浏览器。这里以python，Edge浏览器为例。某些网站具有加密，反爬处理。但是在真正的浏览器上的显示不可能是加密过的，所以通过去真正的去打开浏览器来直接解密

首先导入模块

pip install selenium

然后装入浏览器驱动，这里以Win10系统，Edge浏览器为例

首先进入到edge浏览器的设置界面，然后点击关于Microsoft Edge，查看浏览器版本，然后进入

https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/下载相应版本的驱动。下载完后的可执行程序放入到python的根目录下。

# 谷歌浏览器环境搭建
# pip install selenium
# 下载浏览器驱动https://npm.taobao.org/mirrors/chromedriver
# 让selenium启动谷歌浏览器

6.1 Demo

# 能不能让我的程序连接到浏览器， 让浏览器来完成各种复杂操作，我们只接受最终结果
# selenium: 自动化测试工具
# 可以打开浏览器，然后像人一样去操作浏览器
# 程序员可以直接从selenium上直接提取网页上的各种信息
# 环境搭建：
# pip install selenium
# 下载浏览器驱动https://npm.taobao.org/mirrors/chromedriver
# 让selenium启动谷歌浏览器
from selenium.webdriver import Edge
# 1.创建浏览器对象
web = Edge()
web.get("http://www.baidu.com")
print(web.title)
web.close()

6.2 selenium 的各种操作（如点击操作和对输入框输入）

import time
from selenium.webdriver import Edge
from selenium.webdriver.common.keys import Keys

web = Edge()
web.get("http://lagou.com")

# 找到某个元素点击它
el = web.find_element_by_xpath('//*[@id="changeCityBox"]/ul/li[1]/a')
# 点击事件
el.click()

time.sleep(1)
# 找到输入框，输入 python =》输入回车/点击搜索按钮
web.find_element_by_xpath('//*[@id="search_input"]').send_keys("python", Keys.ENTER)
# web.find_element_by_xpath('//*[@id="search_input"]').send_keys("python", "\n")
li_list = web.find_elements_by_xpath('//*[@id="s_position_list"]/ul/li')
for li in li_list:
    job_name = li.find_element_by_tag_name("h3").text
    job_price = li.find_element_by_xpath("./div/div/div[2]/div/span").text
    company_name = li.find_element_by_xpath('./div/div[2]/div/a').text
    print(job_name, company_name, job_price)

# web.close()

选中页面元素的方式同样可以通过xpath的写法

通过 click() 和send_keys（）方法来点击和输入文本框

同样的，通过text属性获得到文本值

6.3 无头浏览器

无头浏览器就是在后台打开浏览器，不被前台干扰到

方式一：

from selenium.webdriver import Edge
from selenium.webdriver.edge.options import Options
from selenium. webdriver.support.select import Select
import time

# 准备好配置参数
path = "MicrosoftWebDriver.exe"
EDGE = {
            "browserName": "MicrosoftEdge",
            "version": "",
            "platform": "WINDOWS",
            # 关键是下面这个
            "ms:edgeOptions": {
                'extensions': [],
                'args': [
                    '--headless',
                    '--disable-gpu',
                    '--remote-debugging-port=9222',
                ]}
        }
# 把参数配置到浏览器中
web = Edge(executable_path=path, capabilities=EDGE)
# web = Edge()
web. get("https://www.endata.com.cn/BoxOffice/BO/Year/index.html")
time.sleep(2)
# #定位到下拉列表
# sel_el = web.find_element_by_xpath('//*[@id="OptionDate"]')
# # 对元素进行包装，包装成下拉菜单
# sel = Select(sel_el)
# # 去让浏览器进行调整选项
# for i in range(len(sel.options)): # i就是每一个下拉框选项的索引位置
#     sel.select_by_index(i) # 按照索引进行切换
#     time.sleep(2)
#     table = web.find_element_by_xpath('//*[@id="TableList"]/table')
#     print(table.text)
#     print("==============================")


# 如何拿到页面代码（经过数据加载和JS，CSS执行过后的结果的代码）
print(web.page_source)

也有下拉菜单的处理方式

方式二：

from selenium import webdriver
from msedge.selenium_tools import EdgeOptions
from msedge.selenium_tools import Edge

edge_options = EdgeOptions()
edge_options.use_chromium = True
# 设置无界面模式，也可以添加其它设置
edge_options.add_argument('headless')
driver = Edge(options=edge_options)
r = driver.get('https://www.baidu.com')
print(driver.title)
driver.quit()

注意两次导入的包不一样

第二种写法可以使用谷歌内核，这样的话，就可以和谷歌浏览器一样的去使用

6.4 解决浏览器校验问题

我们可以发现，使用selenium的时候，浏览器会显示该软件正被程序控制，有的网站就会去校验这样的浏览器，看是不是真人在操作，如12306.

实际上是根据控制台的window.navigator.webdriver这条指令来进行验证的，如果是被测试软件控制，就会返回true。正常浏览器会返回false

#如果你的程序被识别到了怎么办?
# 1.chrome的版本号如果小于88 在你启动浏览器的时候(此时没有加载任何网页内容)，向页面嵌入js代码。去掉webdrive
# web = chrome()
# web.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
#   "source" :"""
#   window.navigator.webdriver = undefined
#       Object.defineProperty(navigator,'webdriver', {
#           get:() =>undefined
#       })
#   """
# })
#web.get(xxxxXxX)

# chrome的版本号如果大于88
edge_options = EdgeOptions()
edge_options.use_chromium = True
edge_options.add_argument('--disable-blink-features=AutomationControlled')

web = Edge(options=edge_options)

6.5 浏览器窗口之间的切换操作

web = Edge（）就一直是某一个网页，并不会进行自动的切换操作。同样的，如果是iframe标签这样的内联网站也不会进行切换。

from selenium.webdriver import Edge
from selenium.webdriver.common.keys import Keys
import time
web = Edge()

# web.get("http://lagou.com")
#
# web.find_element_by_xpath('//*[@id="cboxClose"]').click()
#
# time.sleep(1)
# web.find_element_by_xpath('//*[@id="search_input"]').send_keys("python", Keys.ENTER)
# time.sleep(3)
#
# el = web.find_element_by_xpath('//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/span')
# web.execute_script("arguments[0].click();", el)
# # 如何进入到新窗口中进行提取
# # 注意在selenuim的眼中，新窗口默认是不切换过来的
# web.switch_to.window(web.window_handles[-1])
# # 在新窗口中提取内容
# job_detail = web.find_element_by_xpath('//*[@id="job_detail"]/dd[2]/div').text
# print(job_detail)
# web.close()
# # 关掉之后还需要变成selenium的窗口视角
# web.switch_to.window(web.window_handles[0])
# print(web.find_element_by_xpath('//*[@id="s_position_list"]/ul/li[1]/div[1]/div[1]/div[1]/a/h3').text)

# 如果页面中遇到了iframe如何处理
web.get("https://www.91kanju.com/vod-play/541-2-1.html")

# 处理iframe，必须先拿到iframe，然后切换视角到iframe，再然后才可以拿数据
iframe = web.find_element_by_xpath('//*[@id="player_iframe"]')
web.switch_to.frame(iframe)
web.switch_to.default_content()     # 切换回原页面

使用switch_to.window（）这个方法来进行页面切换

6.6 selenium小结

可以看到，selenium确实可以直接完成解密操作，但是这个的速度是相当慢的，必须要等整个网页打开之后才能进行操作。所以最好是在操作的间隔 time.sleep()睡上几秒，然后在操作，整体的效率是比较低的。只是适用于网站有加密。
如果想要解决登录过程中遇到的验证码问题，可以自行去学习超级鹰等操作，不在赘述。