python beautifulsoup下载_使用Python和BeautifulSoup从网页下载.xls文件

最新推荐文章于 2023-06-09 05:16:39 发布

氧化三氢正离子

最新推荐文章于 2023-06-09 05:16:39 发布

阅读量194

点赞数

文章标签： python beautifulsoup下载

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_31186111/article/details/114910316

版权

你的剧本目前的问题是：url有一个尾随的/在请求时给出一个无效的页面，而不是列出要下载的文件。

soup.select(...)中的CSS选择器正在选择具有属性webpartid的div，该属性在链接文档中的任何位置都不存在。

您将加入URL并引用它，即使页面中的链接是作为绝对URL给出的，它们不需要引用。

try:...except:块将阻止您看到在尝试下载文件时生成的错误。在没有特定异常的情况下使用except块是不好的做法，应该避免。

修改后的代码版本将获得正确的文件并尝试下载它们，如下所示：from bs4 import BeautifulSoup

# Python 3.x

from urllib.request import urlopen, urlretrieve, quote

from urllib.parse import urljoin

# Remove the trailing / you had, as that gives a 404 page

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'

u = urlopen(url)

try:

html = u.read().decode('utf-8')

finally:

u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements with href attributes containing URLs starting with http://

for link in soup.select('a[href^="http://"]'):

href = link.get('href')

# Make sure it has one of the correct extensions

if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):

continue

filename = href.rsplit('/', 1)[-1]

print("Downloading %s to %s..." % (href, filename) )

urlretrieve(href, filename)

print("Done.")

但是，如果运行此命令，您会注意到抛出了urllib.error.HTTPError: HTTP Error 403: Forbidden异常，即使该文件可以在浏览器中下载。

起初我以为这是一个推荐检查(为了防止热链接)，但是如果你在浏览器(如Chrome开发工具)中按要求观看，你会注意到

初始的http://请求也在那里被阻止，然后Chrome尝试对同一个文件发出https://请求。

换句话说，请求必须通过HTTPS才能工作(不管页面中的url怎么说)。要解决这个问题，您需要在使用请求的URL之前将http:重写为https:。以下代码将正确修改url并下载文件。我还添加了一个变量来指定输出文件夹，该文件夹使用os.path.join添加到文件名中：import os

from bs4 import BeautifulSoup

# Python 3.x

from urllib.request import urlopen, urlretrieve

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'

OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder

u = urlopen(URL)

try:

html = u.read().decode('utf-8')

finally:

u.close()

soup = BeautifulSoup(html, "html.parser")

for link in soup.select('a[href^="http://"]'):

href = link.get('href')

if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):

continue

filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

# We need a https:// URL for this site

href = href.replace('http://','https://')

print("Downloading %s to %s..." % (href, filename) )

urlretrieve(href, filename)

print("Done.")

氧化三氢正离子

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。