补充知识
BeautifulSoup
一.BeautifulSoup是将复杂HTML文档转换成一个复杂的树形结构, 每个节点都是python对象,所有对象可以归纳为4种:
-Tag
-NavigableString
-BeautifulSoup
-Comment
1.Tag : 标签及其内容,只能拿到找到的第一个内容,第二常用
1.1 打印title
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs.title)
结果:<title>百度一下,你就知道</title>
1.2 打印以a开头和以a结尾的内容
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs.a)
结果:<a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新闻--></a>
1.3 打印以head开头和以head结尾的内容
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs.head)
结果:
<head>
<meta content="text/html;charest=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道</title>
</head>
1.4 类别
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(type(bs.title))
print(type(bs.a))
print(type(bs.head))
结果:
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
2.NavigableString :标签里的内容,字符串
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs.title.string)
print(type(bs.title.string))
结果:
百度一下,你就知道
<class 'bs4.element.NavigableString'>
3.BeautifulSoup: 表示整个文档, 最常用
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs.name)
print(type(bs))
结果:
[document]
<class 'bs4.BeautifulSoup'>
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs)
结果:整个文档
4.comment :是一个特殊的NavigableString,输出的内容不包含注释符号
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs.a.string)
print(type(bs.a.string))
结果:
新闻
<class 'bs4.element.Comment'>
5.补充 dict
from bs4 import BeautifulSoup
file = open("./baidu.html","rb")
html = file.read()
bs =BeautifulSoup(html,"html.parser") #parser解析器
print(bs.a.attrs) #拿到一个标签里的所有属性
print(type(bs.a.attrs))
结果:
{'class': ['mnav'], 'href': 'http://news.baidu.com', 'name': 'tj_trnews'}
<class 'dict'>