python is beautiful_python 模块BeautifulSoup使用

最新推荐文章于 2022-01-08 15:13:07 发布

weixin_39821718

最新推荐文章于 2022-01-08 15:13:07 发布

阅读量84

点赞数

文章标签： python is beautiful

转自：http://www.pythonclub.org/modules/beautifulsoup/start

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

一篇文章

------------------------------------

汤料——Soup中的对象

标签（Tag）

标签对应于HTML元素，也就是应于一对HTML标签以及括起来的内容（包括内层标签和文本），如：

soup = BeautifulSoup('Extremely bold')

tag = soup.b

soup.b就是一个标签，soup其实也可以视为是一个标签，其实整个HTML就是由一层套一层的标签组成的。

名字（Name）

名字对应于HTML标签中的名字（也就是尖括号里的第一项）。每个标签都具有名字，标签的名字使用.name来访问，例如上例中，

tag.name == u'b'

soup.name == u'[document]'

属性（Atrriutes）

属性对应于HTML标签中的属性部分（也就是尖括号里带等号的那些）。标签可以有许多属性，也可以没有属性。属性使用类似于字典的形式访问，用方括号加属性名，例如上例中，

tag['class'] == u'boldest'

可以使用.attrs直接获得这个字典，例如，

tag.attrs == {u'class': u'boldest'}

文本（Text）

文本对应于HTML中的文本（也就是尖括号外的部分）。文件使用.text来访问，例如上例中，

tag.text == u'Extremely bold'

string和text区别：

找汤料——Soup中的查找

解析一个HTML通常是为了找到感兴趣的部分，并提取出来。BeautifulSoup提供了find和find_all的方法进行查找。find只返回找到的第一个标签，而find_all则返回一个列表。因为查找用得很多，所以BeautifulSoup做了一些很方便的简化的使用方式：

tag.find_all("a") #等价于 tag("a") 这是4.0的函数find_all

tag.find("a") #等价于 tag.a

因为找不到的话，find_all返回空列表，find返回None，而不会抛出异常，所以，也不用担心 tag("a") 或tag.a 会因为找不到而报错。限于python的语法对变量名的规定，tag.a 的形式只能是按名字查找，因为点号.后面只能接变量名，而带括号的形式 tag() 或 tag.find() 则可用于以下的各种查找方式。

查找可以使用多种方式：字符串、列表、键-值（字典）、正则表达式、函数

字符串：字符串会匹配标签的名字，例如 tag.a 或 tag("a")

列表：可以按一个字符串列表查找，返回名字匹配任意一个字符串的标签。例如 tag("h2", "p")

键-值：可以用tag(key=value)的形式，来按标签的属性查找。键-值查找里有比较多的小花招，这里列几条：

class

class是Python的保留字，不能当变量名用，偏偏在HTML中会有很多 class=XXX 的情况，BeautifulSoup的解决方法是加一下划线，用 class_ 代替,如 tag(class_=XXX)。

True

当值为True时，会匹配所有带这个键的标签，如 tag(href=True)

text

text做为键时表示查找按标签中的文本查找，如 tag(text=something）

正则表达式：例如 tag(href=re.compile("elsie"))

函数：当以上方法都行不通时，函数是终极方法。写一个以单个标签为参数的函数，传入 find 或find_all 进行查找。如

def fun(tag):

return tag.has_key("class") and not tag.has_key("id")

tag(fun) # 会返回所有带class属性但不带id属性的标签

再来一碗——按文档的结构查找

HTML可以解析成一棵标签树，因此也可以按标签在树中的相互关系来查找。

查找上层节点：find_parents() 和 find_parent()

查找下一个兄弟节点：find_next_siblings() 和 find_next_sibling()

查找上一个兄弟节点：find_previous_siblings() 和 find_previous_sibling()

以上四个都只会查同一父节点下的兄弟

查找下层节点：其实上面说的find和find_all就是干这活的

查找下一个节点（无视父子兄弟关系） find_all_next() 和 find_next()

查找上一个节点（无视父子兄弟关系） find_all_previous() 和 find_previous()

以上的这些查找的参都和find一样，可以搭配着用。

看颜色选汤——按CSS查找

一些小花招

BeautifulSoup 可以支持多种解析器，如lxml, html5lib, html.parser. 如：BeautifulSoup("", "html.parser")

BeautifulSoup 在解析之前会先把文本转换成unicode，可以用 from_encoding 指定编码，如：BeautifulSoup(markup, from_encoding="iso-8859-8")

soup.prettify()可以输出排列得很好看的HTML文本，遇上中文的话可以指定编码使其显示正常，如soup.prettify("gbk")

转自：http://cndenis.iteye.com/blog/1746706

soup2个重要的属性：

.contents and .children

A tag’s children are available in a list called .contents:

head_tag = soup.head

head_tag

The Dormouse's story

head_tag.contents

[

The Dormouse's story]

type(head_tag.contents[0])

说明content里面的类型不是string，而是固有的类型

title_tag = head_tag.contents[0]

title_tag

The Dormouse's story

title_tag.contents

# [u'The Dormouse's story']

The BeautifulSoup object itself has children. In this case, the tag is the child of the BeautifulSoup object.:

len(soup.contents)

# 1

soup.contents[0].name

# u'html'

A string does not have .contents, because it can’t contain anything:

text = title_tag.contents[0]

text.contents

# AttributeError: 'NavigableString' object has no attribute 'contents'

如果一个soup对象里面包含了html 标签，那么string是为None的。不管html tag前面是否有string。

soup=BeautifulSoup("

The Dormouse's story")

head=soup.head

print head.string

输出None说明了这个问题

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

for child in title_tag.children:

print(child)

# The Dormouse's story

一个递归获取文本的函数：

defgettextonly(self,soup):

v=soup.stringif v==None:

c=soup.contents

resulttext=''

for t inc:

subtext=self.gettextonly(t)

resulttext+=subtext+'\n'

returnresulttextelse:return v.strip()

一个分割字符串为单词的函数：

defseparatewords(self,text):

splitter=re.compile('\\W')return [s.lower() for s in splitter.split(text) if s!='']

weixin_39821718

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫