Python爬虫-Beautiful Soup库入门

最新推荐文章于 2023-04-10 23:12:44 发布

错落星辰.

最新推荐文章于 2023-04-10 23:12:44 发布

阅读量149

点赞数

本文链接：https://blog.csdn.net/qq_46068895/article/details/106197611

版权

BeautifulSoup库的安装

pip install BeautifulSoup
安装小测：
演示HTML页面地址

import requests
r=requests.get("https://python123.io/ws/demo.html")
demo=r.text
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")#解析器
print(soup.prettify)#美化输出

输出的内容：
在这里插入图片描述

BeautifulSoup库的基本元素

Beautiful Soup库解析器：

解析器	使用方法
bs4的HTML解析器	BeautifulSoup(mk, ‘html.parser’)
lxml的HTML解析器	BeautifulSoup(mk, ‘lxml’)
lxml的XML解析器	BeautifulSoup(mk, ‘xml’)
html5lib的解析器	BeautifulSoup(mk, ‘html5lib’)

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，< p>…<p/ >的名字是’p’，格式:< tag>.name
Attributes	标签的属性，字典形式组织，格式：< tag>.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：< tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类

from bs4 import BeautifulSoup
print(soup.title)
tag=soup.a#标签（）
print(tag)
print(soup.a.name)#标签名
print(soup.a.parent.name)
tag.attrs#属性
tag.attrs['class']
type(tag.attrs)#类型
tag.string

bs4库的HTML内容遍历方法

标签树的下行遍历：

属性	说明
.contents	子节点的列表，将< tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

import requests
from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,"html.parser")
#bs4 遍历
#下行遍历
soup.head.contents#儿子
soup.body.contents
len(soup.body.contents)
for i  in soup.body.contents:  #列表  
        print(i)
for i in soup.body.children:#儿子（迭代 只可用于for）
    print(i)
for i in soup.body.descendants:#子孙
    print(i)

标签树的上行遍历：

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

##parent父亲 parents 父辈 上行遍历
soup.titie.parent
for parent in soup.a.parents: # 遍历soup的a标签的先辈标签
    if parent is None: 
            print(parent) 
    else:
          print(parent.name)

标签树的平行遍历：

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行结点标签

#平行遍历
soup.a.next_sibling
soup.a.previous_sibling
for i in soup.a.previous_siblings:#只能for
    print(i)
for i in soup.a.next_siblings:#（只能for in）
    print(i)

bs4库的HTML格式化和编码

bs4库的prettify()方法

print(soup.a.prettify())#将a标签美化输出

bs4库的编码

soup=BeautifulSoup("<p>中文</p>","html.parser")#中文编码
soup.p.string
print(soup.a.prettify())