Beautiful Soup库的基本元素
解析、遍历、维护标签树的功能库
<p>..</p>:标签Tag
p为Name
class="title"为属性,属性为键值对构成
Beautiful Soup库的引用 from bs4 import BeatifulSoup import bs4
HTML文档、标签树、BeautifulSoup类等价
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")
解析器
- bs4的HTML解析器 'html.parser' 需要bs4库
- lxml的HTML解析器 'lxml' pip install lxml
- lxml的XML解析器 'lxl' pip install lxml
- html5lib的解析器 'html5lib' pip install html5lib
Beautiful Soup类基本元素
- Tag 标签,最基本的信息组织单元 <></>
- Name 标签的名字,如p,.name
- Attributes 属性,如class,.attrs
- NavigableString 标签内费属性字符串,.string,即内容
- Comment 标签内字符串的注释,一种特殊的Comment类型
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"HTML.parser")
soup.title
tag = soup.a
tag
获取标签名字
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
soup.a.name
soup.a.parent.name
soupt.a.parent.parent.name
tag = soup.a
tag.attrs #这是一个字典
tag.attrs['class']
tag.attrs['href']
type(tag.attrs) #dict
type(tag) #bs4.element.Tag
#NavigableString
soup.a.string
soup.p.string
type(soup.p.string) #bs4.element.NavigableString
#Comment 可对类型做判断过滤注释信息
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>, "html.parser")
newsoup.b.string
type(newsoup.b.string) #bs4.element.Comment
type(newsoup.p.string) #bs4.element.NavigableString