【Python自动化】自动化抓取网页上的论文信息以及PDF下载（需要账号信息），更新于2024.9.5

Black_Wordswoth

已于 2024-09-05 11:51:34 修改

阅读量137

点赞数

文章标签： python 自动化 pdf

于 2024-09-05 11:12:00 首次发布

本文链接：https://blog.csdn.net/Black_Fury/article/details/141925667

版权

本文包含了几个本人常用的杂志的论文信息自动化抓取脚本。

一、【AIP】

AIP_Publishing

# 尝试通过网址获取页面文献信息并存入Excel，以爱思唯尔为例(基于selenium)
import re

import httpx
import requests
import pandas as pd
import openpyxl
import time
from selenium import webdriver
from selenium.webdriver.common.by import By


def open_web(url):
    requests.get(url)
    headers = {
        # cookie:用户信息，登录或不登录都有
        'cookie': 'your cookie',
        # 防盗链
        'Referer': 'your Referer',
        # user-agent：浏览器信息，版本，电脑'
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0'
    }
    response = requests.get(url, headers=headers, timeout=10, verify=False)
    # print(response.text)
    return response, headers


# 替换掉不能命名的字符
def replace_x(words):
    words = words.replace('/', '')
    words = words.replace(':', '')
    words = words.replace('*', '')
    words = words.replace('?', '')
    words = words.replace('〈', '')
    words = words.replace('〉', '')
    words = words.replace('|', '')

    return words


# 存储到Excel列表
def save_list(file_path_name, publication_time, Journal_name, url, title):
    # 打开一个现有的工作簿
    workbook = openpyxl.load_workbook(file_path_name)
    # 选择一个工作表
    sheet = workbook.active
    # 创建一个字典
    paper_info = {'time': publication_time, 'Journal': Journal_name, 'url': url, 'title': title}
    # 将数据添加到下一个空行中
    sheet.append(list(paper_info.values()))

    # 保存工作簿
    workbook.save(file_path_name)
    # print(paper_info.values())


# 下载PDF
def Down_PDF(pdf_url, title, time, file_path, headers):
    print(pdf_url)
    PDF_title = '（' + time + '）' + title + '.pdf'
    PDF_content = requests.get(url=pdf_url, headers=headers).content
    with open(file_path + PDF_title, mode='wb') as f:
        f.write(PDF_content)


# 主函数
file_path = 'D:\\paper\\'
while 1:
    print('请输入文献网址（停止请按0）')
    url = input()
    if url == '0':
        break
    (response, headers) = open_web(url)

    # 文章标题
    title = re.findall('<title>(.*?)</title>', response.text)[0]
    title = replace_x(title)
    title = title.replace('  Physics of Fluids  AIP Publishing', '')
    print('标题：', title)

    # 发布时间
    publication_time = re.findall('<meta name="citation_publication_date" content="(.*?)" />', response.text)[0]
    publication_time=publication_time[0:4]
    print('时间：', publication_time)

    # 杂志名称
    Journal_name = 'Physics of Fluids'
    print('杂志：', Journal_name)

    print('（' + publication_time + '）' + title)
    # 将所有信息添加到字典
    print('是否保存？（1/0）')
    is_save = input()
    if int(is_save) == 1:
        file_path_n

最低0.47元/天解锁文章

Black_Wordswoth

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Python自动化】自动化抓取网页上的论文信息以及PDF下载（需要账号信息），更新于2024.9.5

本文包含了几个本人常用的杂志的论文信息自动化抓取脚本。
复制链接

扫一扫