【Python】网页爬取CVPR论文

最新推荐文章于 2024-07-27 22:46:58 发布

Vincent_gc

最新推荐文章于 2024-07-27 22:46:58 发布

阅读量3.2k

点赞数 2

分类专栏： python

本文链接：https://blog.csdn.net/a529975125/article/details/79479438

版权

本文介绍了如何使用Python自动下载CVPR会议的论文。通过requests模块获取网页内容，再利用正则表达式找出论文链接，最后使用urllib下载论文。详细步骤包括获取网页内容、匹配论文链接以及下载过程。

摘要由CSDN通过智能技术生成

动机

利用python自动下载 cvpr论文

流程

获取网页内容
找到所有论文链接
下载

1. 获取网页内容

所用模块：requests

重要函数：requests.get

输出：web_context

参考链接：
http://blog.csdn.net/fly_yr/article/details/51525435

#get web context
def get_context(url):
    """
    params: 
        url: link
    return:
        web_context
    """
    web_context = requests.get(url)
    return web_context.text

2. 找到论文链接

使用模块：import re

重要函数：re.findall()

输出：cvpr论文的下载链接列表

论文Pdf链接形式：
href=“content_cvpr_2016/papers/Hendricks_Deep_Compositional_Captioning_CVPR_2016_paper.pdf”>pdf

使用正则化寻找所有符合此文本形式的链接

参考链接：https://www.cnblogs.com/MrFiona/p/5954084.html
http://blog.csdn.net/u014467169/article/details/51345657

#find paper files

'''
(?<=href=\"): 寻找开头，匹配此句之后的内容
.+: 匹配多个字符（除了换行符）
?pdf: 匹配零次或一次pdf
(?=\">pdf): 以">pdf" 结尾
|: 或
'''
#link pattern: href="***_CVPR_2016_paper.pdf">pdf
link_list = re.findall(r"(?<=href=\").+?pdf(?=\">pdf)|(?<=href=\').+?pdf(?=\">pdf)",web_context)
#name pattern: <a href="***_CVPR_2016_paper.html">***</a>
name_list = re.findall(r