python中beautifulsoup是什么,python中的BeautifulSoup使用小结

最新推荐文章于 2021-09-27 16:52:34 发布

臭熊的哥哥

最新推荐文章于 2021-09-27 16:52:34 发布

阅读量539

点赞数

文章标签： python中beautifulsoup是什么

r = requests.get('http://www.baidu.com/')

soup = BeautifulSoup(r.text, 'html.parser')

soup = BeautifulSoup(open('index.html'))

print soup.prettify() #美化HTML代码显示

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象:

复制代码

soup.head

soup.a

#显示第一个同名标签

soup.head.name #显示标签名称，这里输出‘head’

soup.head.attrs #显示标签的属性，以字典形式返回所有属性

soup.head['class'] #显示head标签的class属性值

soup.head['class'] = 'newclass' #修改head标签class属性值为‘newclass’

del soup.head['class'] #删除head标签的class属性

soup.head.string #获取标签内的正文内容，返回值类型为NavigableString

6.遍历

soup.body.contents[0] #获取body标签的第一个子结点，contents是一个列表

for child in soup.body.children:

print(child.string) #children与contents一样，都获取全部直接子结点，只不过children是一个生成器，需遍历取出

for child in soup.body.descendants:

print(child.string) #递归遍历获取自身下面所有层级的所有节点，从最高一层列出然后下一层，直到最底层。

for string in soup.body.children.strings:

print(repr(string)) #strings获取多个正文内容，需遍历取出，stripped_strings去掉每个字符串前后空格及空行，多余的空格或空行全部去掉，使用方法与strings一致

soup.body.parent #获取父节点

for parent in soup.head.title.string.parents:

print(parent.name) #遍历上级节点路径，返回结果为title,head,html

.next_sibling #下一兄弟节点

.previous_sibling #上一兄弟节点

.next_siblings #往下遍历所有兄弟节点

.previous_siblings #往上遍历所有兄弟节点

.next_element #下一节点，不分层级

.previous_element #上一节点，不分层级

.next_elements #往下顺序遍历所有节点，不分层级

.previous_elements #往上遍历所有节点，不分层级

7.搜索查找标签

find_all( name , attrs , recursive , text , kwargs )

#例：

#(1)name参数

soup.find_all('a') #查找所有a标签

soup.body.div.find_all('a') #查找body下面第一个div中的所有a标签**``

for tag in soup.find_all(re.compile('^b'))；

print(tag.name) #正则表达式查找所有以b开头的标签

soup.find_all(['a','b']) #列表查找，返回所有a标签和b标签

soup.find_all(True) #为True时，所有标签都满足搜索条件，返回所有标签

#以下为自定义过滤条件，筛选满足自定义条件的标签

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id) #返回所有具有class属性但无id属性的标签

#(2)attrs参数，以标签属性搜索

soup.find_all(id='nd2') #返回所有标签中属性id等于nd2的标签

soup.find_all(href=re.compile("elsie"), id='link1') #多个条件同时筛选，可用正则表达式

soup.findall("a", class="sister") #属性中如果有python关键字，比如class属性，不可以直接class='sister',应加个下划线与python关键字区分class_='sister'

soup.find_all(attrs={"data-foo": "value"})

#类似于html5中的data-foo属性不可直接写为soup.find_all(data-foo='value')，因为python命名规则中不允许有中划线(即横杠)，应以字典形式传入attrs参数中，所有的属性搜索都可以使用这种方法

#(3)text参数

soup.find_all(text="Tillie") #搜索文档中的字符串内容为tillie，与name参数一样，可用列表、正则表达式等

#(4)limit参数

soup.find_all('a', limit=2) #返回搜索文档中前两个a标签，文档较大时可节约资源

#(5)recursive参数

soup.head.find_all("title", recursive=False)

#在head的直接子节点中搜索，默认为recursive=True，表示在所有子孙节点中搜索

find( name , attrs , recursive , text , **kwargs )

#与find_all用法完全一致，区别在于find只返回第一个满足条件的结果，而find_all返回的是一个列表，需遍历操作

#以下方法参数用法与 find_all() 完全相同，下面只列出区别

find_parents() find_parent()

#find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等. find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容

find_next_siblings() find_next_sibling()

#这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点

find_previous_siblings() find_previous_sibling()

#这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点

find_all_next() find_next()

#这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点

find_all_previous() 和 find_previous()

#这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代, find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点

臭熊的哥哥

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中beautifulsoup是什么,python中的BeautifulSoup使用小结

r = requests.get('http://www.baidu.com/')soup = BeautifulSoup(r.text, 'html.parser')soup = BeautifulSoup(open('index.html'))print soup.prettify() #美化HTML代码显示Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是...
复制链接

扫一扫