一. 内容简介
python实现sci-hub论文批量下载
二. 软件环境
2.1vsCode
2.2Anaconda
version: conda 22.9.0
2.3代码
链接:
三.主要流程
3.1 下载驱动和插件
我也不确定要不要驱动,因为我没有指定启动路径,而且调用edge我都没有下载驱动,也都可以执行。
调用谷歌浏览器,需要下载浏览器驱动(https://registry.npmmirror.com/binary.html?path=chromedriver/),下载对应的版本,最新的里面没有,网上找一下就可以了,谷歌或者csdn,就安装一下插件,不能用了在下载驱动就好了,测试好像不需要驱动
下面是每一步的操作
安装环境
! pip install selenium
! pip install pyautogui
安装成功
3.2 调用谷歌浏览器打开网页(测试用)
这个驱动没有路径,好像也可以运行,我只指定了网址,浏览的exe路径,就没了
# 引入相关模块
from selenium.webdriver.common.by import By
from selenium import webdriver
#随机数产生
import random
#延时
import time
import pyautogui
from selenium import webdriver
url = 'https://www.sci-hub.se/'
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
# 谷歌浏览器exe位置
options.binary_location = r"C:\Program Files\Google\Chrome\Application\chrome.exe"
# 是否要启动页面
# options.add_argument("--headless") # 启用无头模式
# GPU加速有时候会出bug
options.add_argument("--disable-gpu") # 禁用GPU加速
options.add_argument("--disable-blin-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument',
{'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'})
# 启动要填写的地址,这就启动浏览器
driver.get(url)
# 这是关闭浏览器
# 等待页面加载,可以根据实际情况调整等待时间
driver.implicitly_wait(1)
# 获取完整页面结构
full_page_content = driver.page_source
time.sleep(20)
# 关闭浏览器
driver.quit()
3.3 网页操作
# 引入相关模块
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
import re
def dooooLoab(nameee,model):
# 创建 ChromeOptions 或 EdgeOptions 对象
if model == "edge":
options = webdriver.EdgeOptions() # 或者使用 webdriver.EdgeOptions()
if model == "chrome":
options = webdriver.ChromeOptions() # 或者使用 webdriver.EdgeOptions()
# 创建 ChromeOptions 或 EdgeOptions 对象
#
options.add_experimental_option('excludeSwitches', ['enable-automation'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--disable-gpu") # 禁用 GPU 加速
options.add_argument("--disable-blink-features=AutomationControlled")
# 设置浏览器exe位置
if model == "edge":
options.binary_location = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" # Chrome 或 Edge 的位置
driver = webdriver.Edge(options=options) # 或者使用 webdriver.Edge(options=options)
# 设置浏览器exe位置
if model == "chrome":
options.binary_location = r"C:\Program Files\Google\Chrome\Application\chrome.exe" # Chrome 或 Edge 的位置
driver = webdriver.Chrome(options=options) # 或者使用 webdriver.Edge(options=options)
# 设置 WebDriver
# 启动要填写的地址,这就启动浏览器
url = 'https://www.sci-hub.se/'
driver.get(url)
# 等待页面加载
driver.implicitly_wait(30)
time.sleep(20)
# 获取完整页面结构
full_page_content = driver.page_source
# 查找并填写论文内容
while True:
if len(driver.find_elements(By.ID, 'request')) == 1:
print("查询控件")
textarea_element = driver.find_element(By.ID, 'request')
print("填写论文")
textarea_element.send_keys(nameee)
print("写入完成")
break
print("查询按钮")
buttons = driver.find_elements(By.TAG_NAME, 'button')
if buttons:
buttons[0].click() # 点击第一个找到的按钮
print("开始查询")
else:
print("未找到按钮")
while True:
if len(driver.find_elements(By.ID, 'buttons')) == 1:
print("进入下载页面")
button_par = driver.find_element(By.ID, 'buttons')
button = button_par.find_elements(By.TAG_NAME, 'button')
onclick_value = button[0].get_attribute('onclick')
print("下载链接")
# print(onclick_value)
# 使用正则表达式匹配链接
link_pattern = re.compile(r"location\.href='([^']+)'")
matches = link_pattern.findall(onclick_value)
# 提取链接
if matches:
download_link = matches[0]
print(f"https://sci-hub.se/{download_link}")
uuurl = f"https://sci-hub.se/{download_link}"
else:
print("未找到链接")
break
if len(driver.find_elements(By.ID, 'smile')) == 1:
uuurl = 0
break
# 关闭浏览器
time.sleep(10)
driver.quit()
return uuurl
3.4 循环读取
访问过程会有点慢,因为我加了等待的时间,因为过快的访问会被禁止访问的,有时候会卡很久,所以加延迟时间
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
import re
names = [
"Numerical investigation on heat transfer characteristics of Taylor-Couette flows operating with CO2",
"Taylor-Couette-Poiseuille flow and heat transfer in an annular channel with a slotted rotor",
"low and heat transfer in a micro-cylindrical gas-liquid Couette flow",
"Numerical study of heat transfer in a Taylor-Couette system with forced radial throughflow",
"A solution for the finite journal bearing and its application to analysis and design: III"
]
index = 0
resUlr = []
for i in names:
print(f"正在下载第{index}个")
# chrome
# edge
urlll = dooooLoab(i,"chrome")
if(urlll == 0):
print(f"第{index}个,下载失败")
else:
print(f"第{index}个,下载成功")
resUlr.append(urlll)
index = index + 1
for num in range(len(names)):
print(f"{names[num]} 下载链接: {resUlr[num]}")
爬取成功
Numerical investigation on heat transfer characteristics of Taylor-Couette flows operating with CO2 下载链接: https://sci-hub.se//downloads/2019-11-18/3c/qin2019.pdf?download=true
Taylor-Couette-Poiseuille flow and heat transfer in an annular channel with a slotted rotor 下载链接: https://sci-hub.se///zero.sci-hub.se/6085/85b743f49d1e6dff7e6959cf3a76a2e3/lancial2017.pdf?download=true
low and heat transfer in a micro-cylindrical gas-liquid Couette flow 下载链接: 0
Numerical study of heat transfer in a Taylor-Couette system with forced radial throughflow 下载链接: https://sci-hub.se//downloads/2019-10-26/a1/10.1016@j.ijthermalsci.2019.106142.pdf?download=true
A solution for the finite journal bearing and its application to analysis and design: III 下载链接: https://sci-hub.se///zero.sci-hub.se/838/c53f1d3b2d78f9947f62e1464f9f2c45/a-solution-for-the-finite-journal-bearing-and-its-application-to-1958.pdf?download=true