python download file_Download file from Blob URL with Python

问题

I wish to have my Python script download the Master data (Download, XLSX) Excel file from this Frankfurt stock exchange webpage.

When to retrieve it with urrlib and wget, it turns out that the URL leads to a Blob and the file downloaded is only 289 bytes and unreadable.

http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx

I'm entirely unfamiliar with Blobs and have these questions:

Can the file "behind the Blob" be successfully retrieved using Python?

If so, is it necessary to uncover the "true" URL behind the Blob – if there is such a thing – and how? My concern here is that the link above won't be static but actually change often.

回答1:

That 289 byte long thing might be a HTML code for 403 forbidden page. This happen because the server is smart and rejects if your code does not specify a user agent.

Python 3

# python3

import urllib.request as request

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'

# fake user agent of Safari

fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'

r = request.Request(url, headers={'User-Agent': fake_useragent})

f = request.urlopen(r)

# print or write

print(f.read())

Python 2

# python2

import urllib2

url = 'http://www.xetra.com/blob/1193366/b2f210876702b8e08e40b8ecb769a02e/data/All-tradable-ETFs-ETCs-and-ETNs.xlsx'

# fake user agent of safari

fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'

r = urllib2.Request(url, headers={'User-Agent': fake_useragent})

f = urllib2.urlopen(r)

print(f.read())

回答2:

from bs4 import BeautifulSoup

import requests

import re

url='http://www.xetra.com/xetra-en/instruments/etf-exchange-traded-funds/list-of-tradable-etfs'

html=requests.get(url)

page=BeautifulSoup(html.content)

reg=re.compile('Master data')

find=page.find('span',text=reg) #find the file url

file_url='http://www.xetra.com'+find.parent['href']

file=requests.get(file_url)

with open(r'C:\\Users\user\Downloads\file.xlsx','wb') as ff:

ff.write(file.content)

recommend requests and BeautifulSoup,both good lib

来源:https://stackoverflow.com/questions/39517522/download-file-from-blob-url-with-python

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值