文章目录
BeautifulSoup可以用来解析Requests库爬取的html代码
一、BeautifulSoup的基本使用
import requests
from bs4 import BeautifulSoup as bs
def get_page(url):
try:
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36'}
r = requests.get(url, headers=header)
r.raise_for_status()
return r.text
except:
print("出现异常")
# 创建BeautifulSoup对象
html = get_page("https://www.baidu.com")
soup = bs(html, "html.parser")
# 调用soup.prettify()方法格式化html代码
print(soup.prettify())
二、BeautifulSoup标签的属性
- soup.tag:标签内容,tag可为html中所有的标签,如div、a、p等
print(soup.a, type(soup.a)) # output:<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a> <class 'bs4.element.Tag'>
- soup.tag.name:标签的名称
print(soup.a.name, type(soup.a.name)) # output:a <class 'str'>
- soup.tag.parent.name:标签的父标签名称
print(soup.a.parent.name, type(soup.a.parent.name)) # output:div <class 'str'>
- soup.tag.attrs:标签中的属性,以字典形式输出
print(soup.a.attrs, type(soup.a.attrs)) # output:{'href': 'http://news.baidu.com', 'name': 'tj_trnews', 'class': ['mnav']} <class 'dict'>
- soup.tag.string:标签中的字符串内容
print(soup.a.string, type(soup.a.string)) # output:新闻 <class 'bs4.element.NavigableString'>
- soup.tag.comment
print(soup.a.comment, type(soup.a.comment)) # output:None <class 'NoneType'>
三、BeautifulSoup标签的遍历
(一)下行遍历
- soup.tag.contents:可以输出该标签下的所有结点内容,以列表返回
print(soup.div.contents) # output:<div id="head"><div class="s-top-wrap s-isindex-wrap" id="s_top_wrap"><div class="s-top-nav"></div><div class="s-center-box"></div></div><div id="u"><a class="toindex" href="/">百度首页</a><a class="pf" href="javascript:;" name="tj_settingicon">设置<i class="c-icon c-icon-triangle-down"></i></a><a class="lb" href="https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5" name="tj_login" οnclick="return false;">登录</a><div class="bdpfmenu"></div></div>...
- soup.tag.children:返回其儿子节点的迭代器
for tag in soup.div.children: print(tag.name) # output:script div None div None div div
- soup.tag.descendants:返回其所有子孙的迭代器
tagname = [] for tag in soup.div.descendants: tagname.append(tag.name) print(tagname) # output:['script', None, 'div', 'div', 'div', 'div', 'div', 'a', None, 'a', None, 'i', 'a', None, 'div', 'div', 'a', None, 'a', None, 'a', None, 'a', None, 'a', None, 'a', None, 'div', 'a', None, 'div', 'div', 'a', 'img', 'div', None, 'a', 'img', 'div', None, 'a', 'img', 'div', None, 'a', 'img', 'div', None, 'div', 'a', 'img', 'div', None, 'a', 'img', 'div', None, 'a', 'img', 'div', None, 'a', 'img', 'div', None, 'div', 'a', None, 'div', 'a', None, 'span', None, 'a', None, 'div', 'div', 'a', None, 'a', None, 'div', 'div', 'div', 'style', None, 'div', 'img', 'img', 'map', 'area', 'a', 'img', 'img', 'form', 'input', 'input', 'input', 'input', 'input', 'input', 'input', 'span', 'input', 'span', 'input', 'span', 'span', 'div', 'span', None, 'ul', 'li', 'a', None, 'li', 'a', None, 'li', 'li', 'a', None, 'input', 'input', 'input', 'input', 'input', 'input', 'div', 'div', 'div', 'div', 'a', 'div', None, 'a', 'i', None, 'span', None, 'ul', 'li', 'a', 'span', None, 'span', None, 'span', 'li', 'a', 'span', None, 'span', None, 'span', 'li', 'a', 'span', None, 'span', None, 'span', 'li', 'a', 'span', None, 'span', None, 'span', 'li', 'a', 'span', None, 'span', None, 'span', 'li', 'a', 'span', None, 'span', None, 'span', 'textarea', None, None, 'div', 'div', None, 'div', 'div', 'p', 'a', None, 'p', 'a', None, 'p', 'a', None, 'p', 'a', None, 'p', 'a', None, 'p', 'a', None, 'p', 'a', None, 'div', 'span', None, 'span', None, 'a', 'span', None, 'span', None, None, 'div', None, 'div', None, 'b', None, None, 'a', None, None, 'a', None, None, 'a', None, None, 'a', None, None, 'a', None, None, 'a', None, None, 'a', None, None, 'a', None, None, None, None, 'div', 'div', 'div', 'img', 'img', 'div', 'div', 'div', 'i', None, None, 'div', None, 'div', 'div']
(二)平行遍历
- soup.tag.next_sibling:tag标签同级的下一个标签
print(soup.a.next_sibling) # output:<a class="pf" href="javascript:;" name="tj_settingicon">设置<i class="c-icon c-icon-triangle-down"></i></a>
- soup.tag.next_siblings:tag标签之后同级的所有标签的迭代器
tagname = [] for tag in soup.a.next_siblings: tagname.append(tag.name) print(tagname) # output:['a', 'a', 'div']
- soup.tag.previous_sibling:同理,tag标签同级的上一个标签
- soup.tag.previous_siblings:同理,tag标签之前同级的所有标签的迭代器
(三)上行遍历
- soup.tag.parent:tag标签的父标签
print(soup.a.parent.name) # output:div
- soup.tag.parents:tag标签的所有父标签的迭代器
tagname = [] for tag in soup.a.parents: tagname.append(tag.name) print(tagname) # output:['div', 'div', 'div', 'body', 'html', '[document]']
四、BeautifulSoup标签查找
(一)下行搜索
-
tag.find_all(name, attributes, recursive, string): 在tag标签内搜索name所对应的标签,返回列表,其中包含所有要要搜索标签的内容。可以用tag(name, attributes, recursive, string)简写
- name: 搜索的标签名,可以搜索多个,通过列表输入
- attributes: 搜索的属性
- recursive:True表示搜索所有子孙,False表示搜索儿子结点
- string:搜索字符串
-
tag.find():搜索并返回第一个标签
(二)平行搜索
- tag.find_next_sibling():后续平行结点中搜索,只返回一个,字符串类型
- tag.find_next_siblings():后续平行结点搜索,返回列表
- tag.find_previous_sibling():前序平行结点中搜索,返回一个,字符串类型
- tag.find_previous_siblings():前序平行结点中搜索,返回列表
(三)上行搜索
- tag.find_parent():父辈结点中搜索,返回字符串
- tag.find_parents():父辈结点中搜索,返回列表