除了使用来自urllib2的不存在的请求实体之外,您不能正确地输出到文件中—您不能只打开目录,必须分别打开每个文件进行输出。在
另外,“Requests”包的接口比urllib2好得多。我建议安装它。在
请注意,今天,第一个.zip文件是5.7Gb,因此流式传输到文件是必不可少的。在
真的,你想要更像这样的东西:from BeautifulSoup import BeautifulSoup
import requests
# point to output directory
outpath = 'D:/patent_zips/'
url = 'http://www.google.com/googlebooks/uspto-patents-grants.html'
mbyte=1024*1024
print 'Reading: ', url
html = requests.get(url).text
soup = BeautifulSoup(html)
print 'Processing: ', url
for name in soup.findAll('a', href=True):
zipurl = name['href']
if( zipurl.endswith('.zip') ):
outfname = outpath + zipurl.split('/')[-1]
r = requests.get(zipurl, stream=True)
if( r.status_code == requests.codes.ok ) :
fsize = int(r.headers['content-length'])
print 'Downloading %s (%sMb)' % ( outfname, fsize/mbyte )
with open(outfname, 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024): # chuck size can be larger
if chunk: # ignore keep-alive requests
fd.write(chunk)
fd.close()