本次笔记主要记录BeautifulSoup的一些基本概念和用法
beautifulsoup入门
BeautifulSoup库的基本元素
- 网页语法解析
例如:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')#使用bs4的HTML解析器
print(soup.prettify())#打印美化
- BeautifulSoup类的基本元素
例如:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
#print(soup.prettify())
print(soup.title)
print(soup.a)
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.attrs)
print(soup.a.string)
- beautifulsoup库的理解
例如:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
#print(soup)
#print(soup.a)
print(soup.a.attrs)
基于bs4库的HTML内容遍历方法
HTML的基本格式
- 标签树的下行遍历
例如:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
#print(soup.head)#head标签
#print(soup.head.contents)#它儿子标签
print(soup.body.contents)#body标签
print(len(soup.body.contents))#body标签下所有标签的
用于循环遍历的.children属性
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
for child in soup.body.children:
print(child)
- 标签树的上行遍历、
例如:
import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
#print(soup.title.parent)#title的上行标签
for parent in soup.a.parents: #遍历父辈标签
if parent is None:
print(parent)
else:
print(parent.name)
- 标签树的平行遍历
平行遍历的条件
import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
print(soup.a)
print(soup.a.next_sibling.next_sibling)#a标签的下下一个平行的标签
import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
soup = bs(r.text, 'html.parser')
for sibling in soup.a.next_siblings: #遍历后续所有平行节点
print(sibling)
基于bs4库的HTML格式化和编码
就是prettify的使用
import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
#print(demo)
soup = bs(demo, 'html.parser')
print(soup.prettify())#显示更格式化