python实现sci-hub论文批量下载

最新推荐文章于 2024-07-18 15:53:23 发布

+++.

最新推荐文章于 2024-07-18 15:53:23 发布

阅读量341

点赞数 3

分类专栏： python 文章标签： python 开发语言

本文链接：https://blog.csdn.net/qq_45179361/article/details/138973666

版权

python 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

一. 内容简介

python实现sci-hub论文批量下载

二. 软件环境

2.1vsCode

2.2Anaconda

version: conda 22.9.0

2.3代码

链接：

三.主要流程

3.1 下载驱动和插件

我也不确定要不要驱动，因为我没有指定启动路径，而且调用edge我都没有下载驱动，也都可以执行。
调用谷歌浏览器，需要下载浏览器驱动（https://registry.npmmirror.com/binary.html?path=chromedriver/），下载对应的版本，最新的里面没有，网上找一下就可以了，谷歌或者csdn，就安装一下插件，不能用了在下载驱动就好了，测试好像不需要驱动

下面是每一步的操作
安装环境

! pip install selenium
! pip install pyautogui

安装成功
在这里插入图片描述

3.2 调用谷歌浏览器打开网页（测试用）

这个驱动没有路径，好像也可以运行，我只指定了网址，浏览的exe路径，就没了

# 引入相关模块
from selenium.webdriver.common.by import By
from selenium import webdriver
#随机数产生
import random 
#延时
import time
import pyautogui
from selenium import webdriver
url = 'https://www.sci-hub.se/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
# 谷歌浏览器exe位置
options.binary_location = r"C:\Program Files\Google\Chrome\Application\chrome.exe"
 # 是否要启动页面
        # options.add_argument("--headless")  # 启用无头模式
# GPU加速有时候会出bug
options.add_argument("--disable-gpu")  # 禁用GPU加速
options.add_argument("--disable-blin-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument',
                            {'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'})
 # 启动要填写的地址,这就启动浏览器
driver.get(url)
# 这是关闭浏览器
# 等待页面加载，可以根据实际情况调整等待时间
driver.implicitly_wait(1)

# 获取完整页面结构
full_page_content = driver.page_source
time.sleep(20)
# 关闭浏览器
driver.quit()

3.3 网页操作

# 引入相关模块
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
import re
def dooooLoab(nameee,model):
    # 创建 ChromeOptions 或 EdgeOptions 对象
    if model == "edge":
        options = webdriver.EdgeOptions() # 或者使用 webdriver.EdgeOptions()
    if model == "chrome":

        options = webdriver.ChromeOptions() # 或者使用 webdriver.EdgeOptions()
    # 创建 ChromeOptions 或 EdgeOptions 对象
    # 
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_experimental_option('useAutomationExtension', False)
    options.add_argument("--disable-gpu")  # 禁用 GPU 加速
    options.add_argument("--disable-blink-features=AutomationControlled")

    # 设置浏览器exe位置
   

    
    if model == "edge":
        options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe"  # Chrome 或 Edge 的位置
        driver = webdriver.Edge(options=options)  # 或者使用 webdriver.Edge(options=options)
    # 设置浏览器exe位置
    if model == "chrome":
        options.binary_location = r"C:\Program Files\Google\Chrome\Application\chrome.exe"  # Chrome 或 Edge 的位置
        driver = webdriver.Chrome(options=options)  # 或者使用 webdriver.Edge(options=options)
    # 设置 WebDriver
    

    # 启动要填写的地址，这就启动浏览器
    url = 'https://www.sci-hub.se/'
    driver.get(url)

    # 等待页面加载
    driver.implicitly_wait(30)
    time.sleep(20)
    # 获取完整页面结构
    full_page_content = driver.page_source

    # 查找并填写论文内容
    while True:
        if len(driver.find_elements(By.ID, 'request')) == 1:
            print("查询控件")
            textarea_element = driver.find_element(By.ID, 'request')
            print("填写论文")
            textarea_element.send_keys(nameee)
            print("写入完成")
            break

    print("查询按钮")
    buttons = driver.find_elements(By.TAG_NAME, 'button')
    if buttons:
        buttons[0].click()  # 点击第一个找到的按钮
        print("开始查询")
    else:
        print("未找到按钮")

    while True:
        if len(driver.find_elements(By.ID, 'buttons')) == 1:
            print("进入下载页面")
            button_par = driver.find_element(By.ID, 'buttons')
            button = button_par.find_elements(By.TAG_NAME, 'button')

            onclick_value = button[0].get_attribute('onclick')
            print("下载链接")
            # print(onclick_value)
            # 使用正则表达式匹配链接
            link_pattern = re.compile(r"location\.href='([^']+)'")
            matches = link_pattern.findall(onclick_value)

            # 提取链接
            if matches:
                download_link = matches[0]
                print(f"https://sci-hub.se/{download_link}")
                uuurl = f"https://sci-hub.se/{download_link}"
            else:
                print("未找到链接")
           
            break
        if len(driver.find_elements(By.ID, 'smile')) == 1:
            uuurl = 0
            break
    # 关闭浏览器
    time.sleep(10)
    driver.quit()
    return uuurl

3.4 循环读取

访问过程会有点慢，因为我加了等待的时间，因为过快的访问会被禁止访问的，有时候会卡很久，所以加延迟时间
在这里插入图片描述

from selenium.webdriver.common.by import By
from selenium import webdriver
import time
import re

names = [
    "Numerical investigation on heat transfer characteristics of Taylor-Couette flows operating with CO2",
    "Taylor-Couette-Poiseuille flow and heat transfer in an annular channel with a slotted rotor",
    "low and heat transfer in a micro-cylindrical gas-liquid Couette flow",
    "Numerical study of heat transfer in a Taylor-Couette system with forced radial throughflow",
    "A solution for the finite journal bearing and its application to analysis and design: III"
]
index = 0
resUlr = []
for i in names:
    print(f"正在下载第{index}个")
    # chrome
    # edge
    urlll = dooooLoab(i,"chrome")
    if(urlll == 0):
        print(f"第{index}个,下载失败")
    else:
        print(f"第{index}个,下载成功")
    resUlr.append(urlll)
    index = index + 1



for num in range(len(names)):
    print(f"{names[num]} 下载链接: {resUlr[num]}")

爬取成功

Numerical investigation on heat transfer characteristics of Taylor-Couette flows operating with CO2 下载链接: https://sci-hub.se//downloads/2019-11-18/3c/qin2019.pdf?download=true
Taylor-Couette-Poiseuille flow and heat transfer in an annular channel with a slotted rotor 下载链接: https://sci-hub.se///zero.sci-hub.se/6085/85b743f49d1e6dff7e6959cf3a76a2e3/lancial2017.pdf?download=true
low and heat transfer in a micro-cylindrical gas-liquid Couette flow 下载链接: 0
Numerical study of heat transfer in a Taylor-Couette system with forced radial throughflow 下载链接: https://sci-hub.se//downloads/2019-10-26/a1/10.1016@j.ijthermalsci.2019.106142.pdf?download=true
A solution for the finite journal bearing and its application to analysis and design: III 下载链接: https://sci-hub.se///zero.sci-hub.se/838/c53f1d3b2d78f9947f62e1464f9f2c45/a-solution-for-the-finite-journal-bearing-and-its-application-to-1958.pdf?download=true