embed标签修饰的pdf文件下载——（embed src=“about:blank“）

最新推荐文章于 2024-05-13 15:52:19 发布

ALittleHigh

最新推荐文章于 2024-05-13 15:52:19 发布

阅读量3.5k

点赞数 2

分类专栏： python爬虫文章标签： python windows embedded httpwebrequest

本文链接：https://blog.csdn.net/whitedrogen/article/details/107973308

版权

python爬虫专栏收录该内容

14 篇文章 2 订阅

订阅专栏

今天需要在BioLib网站上看点东西，发现都是pdf文件，还不如下载下来
结果查看源代码后发现都是embed标签修饰，一开始有点蒙，因为src属性里没有链接。
最后发现直接用content就可以。淦。
幸亏没用自动化调试。
在这里插入图片描述
随便点进去一个

发现没有下载链接
但是！！！，打开它的Network，神奇的事情发生了

Content-Length的大小和pdf文档大小很接近，试着下载了一个，发现没有问题，内容无缺失。
既然已经找到了内容，思路理清后，代码也就水到渠成了

import requests
from bs4 import BeautifulSoup
import os

path = r'E:\desktop\BioLib'
url = 'http://www.biolib.de/library/pdf_index_de.html'
headers = {'user-agent': 'Mozilla/5.0'}

res = requests.get(url, headers)

soup = BeautifulSoup(res.text, 'html.parser')

box = soup.find('td', class_='boxrot')

files = box.find_all('a')

for file in files:
    title = file.text
    link = file['href']

    if not os.path.exists(path + '\\' + title):
        content = requests.get(link, headers, stream=True)

        size = 0
        chunk_size = 1024
        content_size = int(content.headers['content-length'])
        print('start download {title}, File size: {size:.2f} MB'.format(title=title,
                                                                        size=content_size / chunk_size / 1024))

        with open(path + '\\' + title, 'wb') as f:
            for data in content.iter_content(chunk_size):
                f.write(data)
                size += len(data)
                print('\r' + '[DOWNLOADING]:%s%.2f%%' %
                      ('>' * int(size * 50 / content_size), float(size / content_size * 100)), end=' ')
        print('\rFile %s download completed!' % title)
    else:
        print('File %s existed!' % title)

欢迎留言交流

只动脑，不动手，知识不长久

ALittleHigh

关注

2
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
embed标签修饰的pdf文件下载——（embed src=“about:blank“）

今天需要在BioLib网站上看点东西，发现都是pdf文件，还不如下载下来结果查看源代码后发现都是embed标签修饰，一开始有点蒙，因为src属性里没有链接。最后发现直接用content就可以。淦。幸亏没用自动化调试。随便点进去一个发现没有下载链接但是！！！，打开它的Network，神奇的事情发生了Content-Length的大小和pdf文档大小很接近，试着下载了一个，发现没有问题，内容无缺失。既然已经找到了内容，思路理清后，代码也就水到渠成了import requestsfrom
复制链接

扫一扫