1.BeautifulSoup 的安装过程:
**1.1 下载 BeautifulSoup 下载地址(点击下载)
**1.2 安装
安装的时候就是:
python setup.py build
python setup.py install
引入包要用:
import bs4
from bs4 import BeautifulSoup
2.BeautifulSoup
#!/usr/bin/evn python
#coding:utf-8
#FileName:re_learn01.py
#Function:show first time to use beautifulSoup
#History:25-10-2013
import bs4
from bs4 import BeautifulSoup;
def bea_Demo():
demoHtml="""
<html>
<body>
<div class="icon_col">
<h1 class="h1user">Certtt</h1>
</div>
</body>
</html>
"""
soup = BeautifulSoup(demoHtml)
print "type(soup)=",type(soup)
print "soup=",soup
h1userSoup = soup.find(name="h1",attrs={"class":"h1user"})
#
print "h1userSoup=",h1userSoup
h1userUnicodeStr = h1userSoup.string
print "h1userUnicodeStr=",h1userUnicodeStr
if __name__=='__main__':
bea_Demo()
结果:
# python be_learn01.py
type(soup)= <class 'bs4.BeautifulSoup'>
soup=
<html>
<body>
<div class="icon_col">
<h1 class="h1user">Certtt</h1>
</div>
</body>
</html>
h1userSoup= <h1 class="h1user">Certtt</h1>
h1userUnicodeStr= Certtt
2.一个简单的页面的测试:
#!/usr/bin/evn python
#coding:utf-8
#FileName:re_learn01.py
#Function:show first time to use beautifulSoup
#History:25-10-2013
import bs4
import urllib
from bs4 import BeautifulSoup
def bea_Demo():
url='http://home.51cto.com/index.php?s=/space/7743046'
ss=urllib.urlopen(url)
page=ss.read()
soup = BeautifulSoup(page)
print "type(soup)=",type(soup)
h1userSoup=[]
h1userSoup = soup.findAll(name="ul")
#print "soup=",soup
for h in h1userSoup:
res=h.findAll('a')
for r in res:
if r!=None:
#print ''
print "***:",r.string,"::",r,"\n"
if __name__=='__main__':
bea_Demo()
结果:
$ python bea_learn02.py
***: 家园 :: <a href="http://home.51cto.com" target="_blank">家园</a>
***: 学院 :: <a href="http://edu.51cto.com" target="_blank">学院</a>
***: 博客 :: <a href="http://blog.51cto.com" target="_blank">博客</a>
***: 论坛 :: <a href="http://bbs.51cto.com" target="_blank">论坛</a>
***: 下载 :: <a href="http://down.51cto.com" target="_blank">下载</a>
***: 自测 :: <a href="http://selftest.51cto.com" target="_blank">自测</a>
***: 门诊 :: <a href="http://doctor.51cto.com" target="_blank">门诊</a>
***: 周刊 :: <a href="http://blog.51cto.com/newsletter/" target="_blank">周刊</a>
***: 读书 :: <a href="http://book.51cto.com" target="_blank">读书</a>
***: 技术圈 :: <a href="http://g.51cto.com" target="_blank">技术圈</a>