BeautifulSoup库(bs4)是解析、遍历、维护变签树的功能库
- BeautifulSoup测试运行
import requests
from bs4 import BeautifulSoup
url = "http://python123.io/ws/demo.html"
r = requests.get(url)
demo = r.text
soup = BeautifulSoup(demo,"html.parser")#为demo指定html的解析器
print(soup.prettify())
- BeautifulSoup属性
BeautifulSoup遍历
- BeautifulSoup下行遍历
contents返回列表类型
children和descendants返回迭代类型,只能用于 for循环
#标签树的下行遍历
#儿子节点
for child in soup.body.children:
print(child)
#儿孙节点
for child in soup.body.descendants:
print(child)
- BeautifulSoup上行遍历
#标签树的上行遍历
#父亲节点
soup.title.parent
#父祖节点
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)
- BeautifulSoup平行遍历
平行遍历条件:
#标签树的平行遍历
soup.a.next_sibling
for sibling in soup.a.next_siblings:
print(sibling)
基于BS4 库的HTML内容查找方法
find_all
find扩展方法
- 提取所有链接
#提取所有链接
for link in soup.find_all('a'):
print(link.get("href"))
- 打印所有标签名称
#打印所有标签名称
for tag in soup.find_all(True):
print(tag.name)