BeautifulSoup库的用法详解

最新推荐文章于 2024-08-12 23:17:41 发布

天涯笨熊

最新推荐文章于 2024-08-12 23:17:41 发布

阅读量10w+

点赞数 5

分类专栏： python类库介绍

本文链接：https://blog.csdn.net/qq_29186489/article/details/78641608

版权

python类库介绍专栏收录该内容

8 篇文章 4 订阅

订阅专栏

BeautifulSoup库是灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。
BeautifulSoup库的安装，可参见博客：http://blog.csdn.net/qq_29186489/article/details/78581249
常用的解析库如下：
这里写图片描述
基本使用如下所示：

#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.prettify())
print(soup.title.string)

标签选择器
选择元素

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

获取名称

#_*_coding: utf-8_*_

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.title.name)

获取属性

#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.a.attrs['href'])
print(soup.a['href'])

获取内容

#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.a.string)

嵌套选择

#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.head.title.string)

子节点和子孙节点

#_*_coding: utf-8_*_

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
#以列表的形式返回子节点
print(soup.p.contents)
#以迭代器的形式返回子节点
print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)
#以迭代器的形式返回子孙节点
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

父节点和祖先节点

#_*_coding: utf-8_*_

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
#获取a标签的父节点
print(soup.a.parent)
#获取a标签的父节点和祖先节点
print(list(enumerate(soup.a.parents)))

获取兄弟节点

#_*_coding: utf-8_*_

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.p.previous_siblings)
print(soup.p.next_siblings)

标准选择器
根据name进行查找

#_*_coding: utf-8_*_

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.find_all("p"))
for item in soup.find_all("p"):
    print(item.find_all("a"))

根据属性进行查找

#_*_coding: utf-8_*_

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.find_all(attrs={"id":"link2"}))
print(soup.find_all(attrs={"class":"title"}))
print(soup.find_all(id="link2"))
print(soup.find_all(class_="title"))

如果使用find方法，返回单个元素
find_parents()返回所有祖先节点
find_parent()返回直接父节点
find_next_siblings()返回后面所有兄弟节点
find_next_sibling()返回后面第一个兄弟节点
find_previous_siblings()返回前面所有的兄弟节点
find_previous_sibling()返回前面第一个的兄弟节点
find_all_next()返回节点后所有符合条件的节点
find_next()返回节点后第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点
find_previous()返回第一个符合条件的节点
CSS选择器
通过select()直接传入CSS选择器即可完成选择

#_*_coding: utf-8_*_

from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
#CSS选择器
print(soup.select(".title b"))
print(soup.select("#link2"))
#获取属性值
print(soup.select("#link2")[0]["class"])
print(soup.select("#link2")[0].attrs["class"])
#获取标签里面的文本内容
print(soup.select("#link2")[0].get_text())