Python下载所有XKCD漫画

最新推荐文章于 2023-10-15 15:28:02 发布

脑电信号要分类

最新推荐文章于 2023-10-15 15:28:02 发布

阅读量348

点赞数

文章标签： python

本文链接：https://blog.csdn.net/haojun1996/article/details/106930502

版权

1、程序要做的事情：

加载主页
保持该页的漫画图片
转入前一张漫画的链接
重复直到第一张漫画

意味着代码要做的事情

利用requests模块下载页面
利用Beautiful Soup找到页面中漫画图像的URL
利用iter_content（）下载漫画图像，并保存到硬盘
找到前一张漫画的链接URL，然后重复

第一步：设计程序

打开一个浏览器的开发者工具，检查该页面上的元素，会发现下面的内容：
漫画图像文件的URL，由一个<img>元素的href属性给出
<img>元素在<div id="comic">元素之内
Prev按钮有一个rel HTML属性，值是prev
第一张漫画的Prev按钮链接到http://xkcd.com/#URL，表明没有前一个页面了

url='https://xkcd.com/'     #starting url
os.makedirs('xkcd',exist_ok=True)   #store comics in ./xkcd

第二步：下载页面

print('Downloading page %s...' % url)
res=requests.get(url)   #下载
res.raise_for_status()  #如果下载发生问题，就抛出异常，并终止程序

第三步：寻找和下载漫画图像

#Find the URL of the comic image.
comicElem=soup.select('#comic img') #如果没有找到任何元素，那么将返回一个空列表，否则将返回一个列表，包含一个<img>元素。可以从这个<img>元素中取得src属性，将它传递给requests.get()，下载这个漫画图像文件
	if comicElem==[]:
		print('Could not find comic image.')
	else:
		comicUrl=comicElem[0].get('src')
		print('Downloading image %s...' % (comicUrl))
		res=requests.get('http:'+comicUrl)
		res.raise_for_status()

漫画图像的<img>元素识在一个<div>元素中，它带有的id属性设置为comic。所以选择器‘#comic img’将从BeatifulSoup对象中选出正确的<img>元素

第四步：保存图像，找到前一张漫画

#Save the image to ./xkcd.
imageFile=open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')
for chunk in res.iter_content(100000):
    imageFile.write(chunk)
imageFile.close()

#Get the Prev Button's url
prevLink=soup.select('a[rel="prev"]')[0]
url='https://xkcd.com/'+prevLink.get('href')

这时，漫画的图像文件保存在变量res中。你需要将图像数据写入硬盘的文件。

整个项目的代码如下：

import requests,os,bs4
url='https://xkcd.com/' #starting url
os.makedirs('xkcd',exist_ok=True)#store comics in ./xkcd
while not url.endswith('#'):
	print('Downloading page %s...' % url)
	res=requests.get(url)
	res.raise_for_status()
	soup=bs4.BeautifulSoup(res.text,features='html.parser')
	
	comicElem=soup.select('#comic img')
	if comicElem==[]:
		print('Could not find comic image.')
	else:
		comicUrl=comicElem[0].get('src')
		print('Downloading image %s...' % (comicUrl))
		res=requests.get('http:'+comicUrl)
		res.raise_for_status()
		imageFile=open(os.path.join('xkcd',os.path.basename(comicUrl)),'wb')
		for chunk in res.iter_content(100000):
			imageFile.write(chunk)
		imageFile.close()
		
		prevLink=soup.select('a[rel="prev"]')[0]
		url='https://xkcd.com/'+prevLink.get('href')
	
	
print('Done')