Python爬虫学习之爬取下载pdf文献

最新推荐文章于 2024-06-11 20:20:32 发布

弓长女爱♡

最新推荐文章于 2024-06-11 20:20:32 发布

阅读量1.6k

点赞数

分类专栏： Python爬虫文章标签： python

本文链接：https://blog.csdn.net/qq_45742126/article/details/112415509

版权

Python爬虫专栏收录该内容

8 篇文章 2 订阅

订阅专栏

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import os
from urllib.request import urlretrieve

url = "http://cjc.ict.ac.cn/qwjs/No2020-01.htm"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
ret = Request(url, headers=headers)
html = urlopen(ret)
bs = BeautifulSoup(html, "html.parser")
div = bs.find("div", {"class": "Section1"})
titles = div.find_all("span", {"style": "color:#006688"})
a_all = div.find_all("a")
i = 0
title_list = []
for title in titles:
    title = title.get_text().encode('iso-8859-1').decode('gbk')
    title_list.append(title)
for a in a_all:
    href = a["href"]
    if not os.path.exists('./Directory'):
        os.makedirs('./Directory')
    dir = os.path.abspath('./Directory')
    work_path = os.path.join(dir, '{}.pdf').format(str(i + 1) + title_list[i])
    urlretrieve(href, work_path)
    i += 1

弓长女爱♡

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
Python爬虫学习之爬取下载pdf文献

from urllib.request import urlopen, Requestfrom bs4 import BeautifulSoupimport osfrom urllib.request import urlretrieveurl = "http://cjc.ict.ac.cn/qwjs/No2020-01.htm"headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
复制链接

扫一扫

专栏目录