为学习CVPR 2024的论文,希望能够将论文电子版保存到本地阅读。然而逐个点击下载着实费时费力,因此想到通过网络爬虫的方法对论文的PDF文件和相关附件批量获取下载,遂以此篇笔记记录方法和心得总结。限于理论学习有限,此番操作可能仍存在极大的优化空间。
声明:本文内容仅作学习用,如有侵权或复现等问题,恳请告知,欢迎讨论。资源是官方开源的,因此可以直接访问官网获取: CVPR 2024 open access repository
代码
下载文件需要一段时间,着急的朋友可以先行下载,等待过程中如有兴趣,欢迎再来看这篇记录此后部分的碎碎念…
import requests
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
headers = {
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
# Get html from website
if not os.path.exists('./website.html'):
response_html = requests.get(url='https://openaccess.thecvf.com/CVPR2024?day=all', headers=headers)
with open('./website.html', 'w', encoding='utf-8') as web_html:
web_html.write(response_html.text)
# Get the item
driver = webdriver.Chrome()
driver.get(os.path.abspath('./website.html'))
for paper in driver.find_elements(By.CLASS_NAME, 'ptitle'):
# locate
node_ = paper.find_element(By.XPATH, r'following-sibling::dd[2]')
pdf_url = 'https://openaccess.thecvf.com' + node_.find_element(By.XPATH, './a[1]').get_attribute('href')[10:]
print(f'locate pdf from link:\n{
pdf_url}')
try:
if node_.find_element(By.LINK_TEXT, 'supp').get_attribute('href')[-4:] == '.zip':
zip_url = 'https://openaccess.thecvf.com' + node_.find_element(By.LINK_TEXT, 'supp').get_attribute('href')[10:]
print(f'locate zip from link:\n{
zip_url}')
else:
zip_url = None
except:
continue
# download
paper_title = pdf_url[54:-20]
root_ = os.getcwd()
file_root = os.path.join(root_, paper_title)
file_name = paper_title[:35]
os.makedirs(file_root, exist_ok=True)
os.chdir(file_root)
response_pdf = requests.get(url=pdf_url, headers=headers)
if