Beautiful Soup 4的安装及相关问题
Beautiful Soup的最新版本是4.1.1可以在此获取(http://www.crummy.com/software/BeautifulSoup/bs4/download/)
文档:
(http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
使用:
from bs4 import BeautifulSoup
Example:
html文件:
html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
代码:
from
soup = BeautifulSoup(html_doc)
接下来可以开始使用各种功能
soup.X (X为任意标签,返回整个标签,包括标签的属性,内容等)
如:soup.title
BeautifulSoup中的Object
tag (对应html中的标签)
tag.attrs (以字典形式返回tag的所有属性)
可以直接对tag的属性进行增、删、改,跟操作字典一样
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
X.contents (X为标签,可返回标签的内容)
eg.
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
解决解析网页出现乱码问题:
import urllib2 |
2 | from BeautifulSoup import BeautifulSoup |
3 |
4 | page = urllib2.urlopen( 'http://www.leeon.me' ); |
5 | soup = BeautifulSoup(page,fromEncoding = "gb18030" ) |
6 |
7 | print soup.originalEncoding |
8 | print soup.prettify() |
如果中文页面编码是gb2312,gbk,在BeautifulSoup构造器中传入fromEncoding="gb18030"参数即可解决乱码问题,即使分析的页面是utf8的页面使用gb18030也不会出现乱码问题!