python3 用requests 保存网页以及BeautifulSoup保存图片，并且在本地可以正常显示文章的内容和图片...

最新推荐文章于 2021-01-02 17:56:13 发布

weixin_30292843

最新推荐文章于 2021-01-02 17:56:13 发布

阅读量240

点赞数

文章标签： python 爬虫

原文链接：http://www.cnblogs.com/nancyzhu/p/8412950.html

版权

用requests 模块做了个简单的爬虫小程序，将博客的一篇文章以及图片保存到本地，文章格式存为'.html'。当文章保存到本地后，图片的连接可能是目标站点的绝对或者相对路径，所以要是想在本地也显示图片，需要将保存下来图片的本地路径替换到本地的html文件里。

保存网页用的时requests模块，保存图片用的时BeautifulSoup, 这两个都是第三方模块，需要安装，使用时需要手动导入。

**安装方式：

pip install requsts

在python3 可能用 pip install beautifulsoup 会报错，可以直接pip install bs4，这样时可以成功安装的。

因为其实beautifulsoup 在bs4安装包中，使用的时候采用：from bs4 import beautifulsoup

具体的代码如下：

 1 from bs4 import BeautifulSoup
 2 import requests,os
 3 targetDir = os.path.join(os.path.dirname(os.path.abspath(__file__)),'imgs1')#图片保存的路径，eg,向前文件夹为'D:\Coding', 即图片保存在'D:\Coding\imgs1\'
 4 if not os.path.isdir(targetDir):#不存在创建路径
 5     os.mkdir(targetDir)
 6 url = 'http://www.cnblogs.com/nancyzhu/p/8146408.html'
 7 domain = 'http://www.cnblogs.com'
 8 #保存页面到本地
 9 def save_html():
10     r_page = requests.get(url)
11     f = open('page.html','wb')
12     f.write(r_page.content)#save to page.html
13     f.close()
14     return r_page
15 #修改文件,将图片路径改为本地的路径
16 def update_file(old,new):
17     with open('page.html', encoding='utf-8') as f, open('page_bak.html', 'w',
18                                                    encoding='utf-8') as fw:  # 打开两个文件，原始文件用来读，另一个文件将修改的内容写入
19         for line in f:  # 遍历每行，取出来的是字符串，因此可以用replace 方法替换
20             new_line = line.replace(old, new)  # 逐行替换
21             fw.write(new_line)  # 写入新文件
22     os.remove('page.html')  # 删除原始文件
23     os.rename('page_bak.html', 'page.html')  # 修改新文件名， old -> new
24 #保存图片到本地
25 def save_file_to_local():
26     obj = BeautifulSoup(save_html().content, 'lxml')  # 后面是指定使用lxml解析，lxml解析速度比较快，容错高。
27     imgs = obj.find_all('img')
28     #将页面上图片的链接加入list
29     urls = []
30     for img in imgs:
31         if 'data-src' in str(img):
32             urls.append(img['data-src'])
33         else:
34             urls.append(img['src'])
35     #遍历所有图片链接，将图片保存到本地指定文件夹，图片名字用0，1，2...
36     i = 0
37     for url in urls:#看下文章的图片有哪些格式，一一处理
38         if url.startswith('//'):
39             new_url = 'http:' + url
40             r = requests.get(new_url)
41         elif url.startswith('/') and url.endswith('gif'):
42             new_url = domain + url
43             r = requests.get(new_url)
44         elif url.endswith('.png') or url.endswith('jpg') or url.endswith('gif'):
45             r = requests.get(url)
46         t = os.path.join(targetDir, str(i) + '.jpg')#指定目录
47         fw = open(t,'wb')  # 指定绝对路径
48         fw.write(r.content)#保存图片到本地指定目录
49         i += 1
50         update_file(url,t)#将老的链接(有可能是相对链接)修改为本地的链接，这样本地打开整个html就能访问图片
51         fw.close()
52 
53 if __name__ == '__main__':
54     save_html()
55     save_file_to_local()

转载于:https://www.cnblogs.com/nancyzhu/p/8412950.html

weixin_30292843

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3 用requests 保存网页以及BeautifulSoup保存图片，并且在本地可以正常显示文章的内容和图片...

用requests 模块做了个简单的爬虫小程序，将博客的一篇文章以及图片保存到本地，文章格式存为'.html'。当文章保存到本地后，图片的连接可能是目标站点的绝对或者相对路径，所以要是想在本地也显示图片，需要将保存下来图片的本地路径替换到本地的html文件里。保存网页用的时requests模块，保存图片用的时BeautifulSoup, 这两个都是第三方模块，需要安装，使用时需要手动导入。...
复制链接

扫一扫