python-75：BS4实例1源码

最新推荐文章于 2023-01-31 17:27:57 发布

weixin_33998125

最新推荐文章于 2023-01-31 17:27:57 发布

阅读量132

点赞数

文章标签： python 爬虫

原文链接：https://my.oschina.net/u/2429887/blog/600910

版权

2019独角兽企业重金招聘Python工程师标准>>>

最终实现我们所有功能的源码是这样的

#!/usr/bin/env python
# -*- coding:UTF-8 -*-
__author__ = '217小月月坑'
 
'''
实例一最终源码
'''
 
from bs4 import BeautifulSoup
import urllib2
# deal with the coding error
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

url = 'http://beautifulsoup.readthedocs.org/zh_CN/latest/#'
request = urllib2.Request(url)
response = urllib2.urlopen(request)
contents = response.read()
soup = BeautifulSoup(contents)
# get the title
title = soup.title.string
# get the text
result = soup.find(itemprop="articleBody")
for i in result.find_all(attrs={"class": "headerlink"}):
    i.clear()
# write to a file
path = '/home/ym/'+title
f = open (path,"w+")
f.write(result.get_text())
f.close()
print "done!!"

写到文件中的效果是这样的：

好了，这一个实例就这样简单的结束了。我们来回顾一下在这个过程中我们经历了怎样的一个过程

使用urllib2将网页源码下载下来，便于后续的分析
使用BS4获取我们想要的内容
2中获取的内容带有我们不想要的字符，所以我们使用BS4对文档树的删除方法将该特殊字符删除

在这整个过程中，我们学习了BS4搜索文档树的内容，这一部分是知识在爬虫中占又很重要的地位，除此之外，我们还学习了BS4修改文档树的部分，比如删除文档树，同时，我们也学习了一些BS4中输出和错误的处理，我们实现这个实例所学到的知识已经占了整个BS4文档中的一半的内容，剩下的内容我们可以在用到的时候再慢慢学习

转载于:https://my.oschina.net/u/2429887/blog/600910