【附代码】CVPR 2024论文及其附件批量获取下载

最新推荐文章于 2025-02-10 16:53:51 发布

m0_52684476

最新推荐文章于 2025-02-10 16:53:51 发布

阅读量2.4k

点赞数 28

分类专栏：仅作学习之用的数据采集记录文章标签： python 网络爬虫 selenium beautifulsoup 笔记

本文链接：https://blog.csdn.net/m0_52684476/article/details/140308700

版权

文章目录

为学习CVPR 2024的论文，希望能够将论文电子版保存到本地阅读。然而逐个点击下载着实费时费力，因此想到通过网络爬虫的方法对论文的PDF文件和相关附件批量获取下载，遂以此篇笔记记录方法和心得总结。限于理论学习有限，此番操作可能仍存在极大的优化空间。
声明：本文内容仅作学习用，如有侵权或复现等问题，恳请告知，欢迎讨论。资源是官方开源的，因此可以直接访问官网获取： CVPR 2024 open access repository

代码

下载文件需要一段时间，着急的朋友可以先行下载，等待过程中如有兴趣，欢迎再来看这篇记录此后部分的碎碎念…

import requests
import os

from selenium import webdriver
from selenium.webdriver.common.by import By

headers = {
   
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}

# Get html from website
if not os.path.exists('./website.html'):
    response_html = requests.get(url='https://openaccess.thecvf.com/CVPR2024?day=all', headers=headers)
    with open('./website.html', 'w', encoding='utf-8') as web_html:  
        web_html.write(response_html.text)

# Get the item
driver = webdriver.Chrome()
driver.get(os.path.abspath('./website.html'))
for paper in driver.find_elements(By.CLASS_NAME, 'ptitle'):
    # locate
    node_ = paper.find_element(By.XPATH, r'following-sibling::dd[2]')
    pdf_url = 'https://openaccess.thecvf.com' + node_.find_element(By.XPATH, './a[1]').get_attribute('href')[10:]
    print(f'locate pdf from link:\n{
     pdf_url}')
    try:
        if node_.find_element(By.LINK_TEXT, 'supp').get_attribute('href')[-4:] == '.zip':
            zip_url = 'https://openaccess.thecvf.com' + node_.find_element(By.LINK_TEXT, 'supp').get_attribute('href')[10:]
            print(f'locate zip from link:\n{
     zip_url}')
        else:
            zip_url = None
    except:
        continue

    # download
    paper_title = pdf_url[54:-20]
    root_ = os.getcwd()
    file_root = os.path.join(root_, paper_title)
    file_name = paper_title[:35]
    
    os.makedirs(file_root, exist_ok=True)
    os.chdir(file_root)
    response_pdf = requests.get(url=pdf_url, headers=headers)
    if