Beautiful Soup 是一个处理Python HTML/XML的模块,功能相当强劲,最近仔细的看了一下他的帮助文档,终于看明白了一些。 准备好好研究一下,顺便将Beautiful Soup的一些用法整理一下,放到这个wiki上面,那个文档确实不咋地。
Beautiful Soup 中文教程的官方页面:http://www.crummy.com/software/BeautifulSoup/
BeautifulSoup 下载与安装
下载地址为:
http://www.crummy.com/software/BeautifulSoup/
安装其实很简单,BeautifulSoup只有一个文件,只要把这个文件拷到你的工作目录,就可以了。
from BeautifulSoup import BeautifulSoup # For processing HTML
from BeautifulSoup import BeautifulStoneSoup # For processing XML
import BeautifulSoup # To get everything
创建 BeautifulSoup 对象
BeautifulSoup对象需要一段html文本就可以创建了。
下面的代码就创建了一个BeautifulSoup对象:
from BeautifulSoup import BeautifulSoup
doc = ['<html><head><title>PythonClub.org</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b> of pythonclub.org.',
'</html>']
soup = BeautifulSoup(''.join(doc))
查找HTML内指定元素
BeautifulSoup可以直接用”.”访问指定HTML元素
根据html标签(tag)查找:查找html title
可以用 soup.html.head.title 得到title的name,和字符串值。
>>> soup.html.head.title
<title>PythonClub.org</title>
>>> soup.html.head.title.name
u'title'
>>> soup.html.head.title.string
u'PythonClub.org'
>>>
也可以直接通过soup.title直接定位到指定HTML元素:
>>> soup.title
<title>PythonClub.org</title>
>>>
根据html内容查找:查找包含特定字符串的整个标签内容
下面的例子给出了查找含有”para”的html tag内容:
>>> soup.findAll(text=re.compile("para"))
[u'This is paragraph ', u'This is paragraph ']
>>> soup.findAll(text=re.compile("para"))[0].parent
<p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.</p>
>>> soup.findAll(text=re.compile("para"))[0].parent.contents
[u'This is paragraph ', <b>one</b>, u' of ptyhonclub.org.']
根据CSS属性查找HTML内容
soup.findAll(id=re.compile("para$"))
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
soup.findAll(attrs={'id' : re.compile("para$")})
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
# <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]