BeautifulSoup库是灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。
BeautifulSoup库的安装,可参见博客:http://blog.csdn.net/qq_29186489/article/details/78581249
常用的解析库如下:
基本使用如下所示:
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.prettify())
print(soup.title.string)
标签选择器
选择元素
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
获取名称
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.title.name)
获取属性
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.a.attrs['href'])
print(soup.a['href'])
获取内容
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.a.string)
嵌套选择
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.head.title.string)
子节点和子孙节点
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
#以列表的形式返回子节点
print(soup.p.contents)
#以迭代器的形式返回子节点
print(soup.p.children)
for i,child in enumerate(soup.p.children):
print(i,child)
#以迭代器的形式返回子孙节点
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)
父节点和祖先节点
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
#获取a标签的父节点
print(soup.a.parent)
#获取a标签的父节点和祖先节点
print(list(enumerate(soup.a.parents)))
获取兄弟节点
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.p.previous_siblings)
print(soup.p.next_siblings)
标准选择器
根据name进行查找
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.find_all("p"))
for item in soup.find_all("p"):
print(item.find_all("a"))
根据属性进行查找
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
print(soup.find_all(attrs={"id":"link2"}))
print(soup.find_all(attrs={"class":"title"}))
print(soup.find_all(id="link2"))
print(soup.find_all(class_="title"))
如果使用find方法,返回单个元素
find_parents()返回所有祖先节点
find_parent()返回直接父节点
find_next_siblings()返回后面所有兄弟节点
find_next_sibling()返回后面第一个兄弟节点
find_previous_siblings()返回前面所有的兄弟节点
find_previous_sibling()返回前面第一个的兄弟节点
find_all_next()返回节点后所有符合条件的节点
find_next()返回节点后第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点
find_previous()返回第一个符合条件的节点
CSS选择器
通过select()直接传入CSS选择器即可完成选择
#_*_coding: utf-8_*_
from bs4 import BeautifulSoup
import requests
r=requests.get("http://python123.io/ws/demo.html")
soup=BeautifulSoup(r.text,"lxml")
#CSS选择器
print(soup.select(".title b"))
print(soup.select("#link2"))
#获取属性值
print(soup.select("#link2")[0]["class"])
print(soup.select("#link2")[0].attrs["class"])
#获取标签里面的文本内容
print(soup.select("#link2")[0].get_text())
总结:
1:推荐使用LXML解析库,必要时使用html.parse
2:标签悬着器筛选功能弱,但是速度快
3:建议使用find(),find_all()查询匹配单个结果或者多个结果
4:如果对CSS熟悉的话,建议使用select()
5:熟悉常用的获取属性和文本值的方法