Python爬虫批量下载文献

最新推荐文章于 2024-06-18 16:33:43 发布

程序员-夏天

最新推荐文章于 2024-06-18 16:33:43 发布

阅读量7.3k

点赞数 9

分类专栏： Python 文章标签： python 开发语言后端爬虫编程语言

本文链接：https://blog.csdn.net/weixin_50097774/article/details/121564394

版权

最近在看NeurIPS的文章，但是一篇篇下载太繁琐，希望能快速批量下载下来。
于是想到了之前一直听说的python爬虫，初次学着弄一下。

用到了requests，BeautifulSoup，urllib.request包
先放最终运行的程序：

结果程序

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import os

BASE_URL = 'https://proceedings.neurips.cc/'


# 打开网站并下载
def openAndDownload(url, title):
    str_subhtml = requests.get(url)
    soup1 = BeautifulSoup(str_subhtml.text, 'lxml')
    subdata = soup1.select('body > div.container-fluid > div > div > a:nth-child(4)')
    # print('subdata:', subdata)
    downloadUrl = BASE_URL + subdata[0].get('href')
    print(downloadUrl)
    getFile(downloadUrl, title)


# 下载文件
def getFile(url, title):
    title = replaceIllegalStr(title)
    filename = title + '.pdf'
    urlretrieve(url, './essay/%s' % filename.split('/')[-1])
    print("Sucessful to download " + title)


# 替换非法命名字符
def replaceIllegalStr(str):
    str = str.replace(':', '')
    str = str.replace('?', '')
    str = str.replace('/', '')
    str = str.replace('\\', '')
    return str


def main():
    url = 'https://proceedings.neurips.cc/paper/2020'
    strhtml = requests.get(url)
    soup = BeautifulSoup(strhtml.text, 'lxml')
    data = soup.select('body > div.container-fluid > div > ul > li > a')

    list = []
    for item in data:
        list.append([item.get_text(), item.get('href')])

    name = ['title', 'link']
    test = pd.DataFrame(columns=name,

最低0.47元/天解锁文章

程序员-夏天

关注

9
点赞
踩
100

收藏

觉得还不错? 一键收藏
3
评论
Python爬虫批量下载文献

最近在看NeurIPS的文章，但是一篇篇下载太繁琐，希望能快速批量下载下来。于是想到了之前一直听说的python爬虫，初次学着弄一下。参考了python爬虫入门教程：Python爬虫入门教程：超级简单的Python爬虫教程;用到了requests，BeautifulSoup，urllib.request包先放最终运行的程序：结果程序import requestsimport pandas as pdfrom bs4 import BeautifulSoupfrom urllib.r.
复制链接

扫一扫