国家地理爬虫

最新推荐文章于 2022-02-19 10:31:42 发布

weixin_30481087

最新推荐文章于 2022-02-19 10:31:42 发布

阅读量199

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/yxxblog/p/9543333.html

版权

先分析一下网页结构：

可以看到图片都是放在div class=‘imgbox’

　　　　　　　　　　　　div class=‘img’

用requests解析网页，用Beautiful soup将class= ‘img’中的图片链接解析出来，

然后（比较重要的一步）：

再次运用requests将图片的链接解析，然后保存。

文件的的创建，按照链接的名字创建链接，具体代码如下所示，具体讲解无。

纯文本：

#国家地理中一篇文档中图片的爬虫
import os
import requests
from bs4 import BeautifulSoup
def craete_dir(name):
if not os.path.exists(name):
os.makedirs(name)
def getlink(url):
herader = {'User-Agent':'Mozilla/5.0'}
req = requests.get(url,headers=herader)
req.encoding=req.apparent_encoding
soup = BeautifulSoup(req.text,'lxml')
soups= soup.findAll('div',class_='imgbox')
for link in soups:
links =link.find('div',class_='img').find('img')
links_a=links.get('src')
#craete_dir('c:/Users/****/Desktop/haha/{}'.format(links_a.split('/')[-1]))
print(links_a)
jieguo = requests.get(links_a)
print(links_a.split('/')[-1][:-5])
craete_dir('c:/Users/***/Desktop/haha')
with open('c:/Users/***/Desktop/haha/{}'.format(links_a.split('/')[-1])[:-5],'wb')as f:
f.write(jieguo.content)
getlink('http://www.dili360.com/article/p5b57eab97878985.htm')

转载于:https://www.cnblogs.com/yxxblog/p/9543333.html

weixin_30481087

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
国家地理爬虫

先分析一下网页结构：可以看到图片都是放在div class=‘imgbox’　　　　　　　　　　　　div class=‘img’用requests解析网页，用Beautiful soup将class= ‘img’中的图片链接解析出来，然后（比较重要的一步）：再次运用requests将图片的链接解析，然后保存。文件的的创建，按照链接的名字创建链接，具体代码如下所示，具体...
复制链接

扫一扫