小白学爬虫笔记5---beautifulsoup库基本元素

最新推荐文章于 2021-12-17 23:16:06 发布

paleyellow

最新推荐文章于 2021-12-17 23:16:06 发布

阅读量236

点赞数

分类专栏：笔记 python

笔记同时被 2 个专栏收录

24 篇文章 1 订阅

订阅专栏

python

17 篇文章 0 订阅

订阅专栏

Beautiful Soup库的基本元素

解析、遍历、维护标签树的功能库

<p>..</p>：标签Tag
p为Name
class="title"为属性，属性为键值对构成

Beautiful Soup库的引用 from bs4 import BeatifulSoup import bs4
HTML文档、标签树、BeautifulSoup类等价

from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")

解析器

bs4的HTML解析器 'html.parser' 需要bs4库
lxml的HTML解析器 'lxml' pip install lxml
lxml的XML解析器 'lxl' pip install lxml
html5lib的解析器 'html5lib' pip install html5lib

Beautiful Soup类基本元素

Tag 标签，最基本的信息组织单元 <></>
Name 标签的名字，如p,.name
Attributes 属性,如class,.attrs
NavigableString 标签内费属性字符串，.string,即内容
Comment 标签内字符串的注释，一种特殊的Comment类型

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"HTML.parser")
soup.title
tag = soup.a
tag

获取标签名字

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
soup.a.name
soup.a.parent.name
soupt.a.parent.parent.name
tag = soup.a
tag.attrs #这是一个字典
tag.attrs['class']
tag.attrs['href']
type(tag.attrs) #dict
type(tag) #bs4.element.Tag
#NavigableString
soup.a.string
soup.p.string 
type(soup.p.string) #bs4.element.NavigableString
#Comment 可对类型做判断过滤注释信息
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>, "html.parser")
newsoup.b.string
type(newsoup.b.string) #bs4.element.Comment
type(newsoup.p.string) #bs4.element.NavigableString