一 实例html
案例html文件bs4.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>bs4_text</title>
</head>
<body>
<div id = "id1" name = "div1" class = "div_class1">
<h1></h1>
<h2>h2的内容</h2>
<a class = "a_class1" href = "http://www.baidu.com">
baidu
</a>
<a class = "a_class1" href = "http://www.taobao.com">taobao</a>
<span>
<span>span_text</span>
</span>
</div>
<div id = "id2" name = 'div2' class = "div_class2">
<a class = "a_class2" href = "http://www.jd.com"></a>
<a class = "a_class2" href = "http://www.vip.com">vip</a>
</div>
</body>
</html>
二、str与BeautifulSoup互相转换
2.1安装bs4:
pip install beautifulsoup4
2.2使用bs4的BeautifulSoup
from bs4 import BeautifulSoup
with open(“bs4.html”,”r”,encoding=”utf-8”) as f:
html = f.read()
bs = BeautifulSoup(html)
html_str1 = str(bs)
html_str2 = bs.prettify()
三、四大对象
3.1 Tag对象(两个重要属性:name和attrs)
div_tag = bs.div
print(div_tag)
print(type(div_tag))
print(div_tag.name)
print(div_tag.attrs)
print(div_tag[‘id’])
print(div_tag.a[‘class’])
命令 | 解释 |
---|
BeautifulSoup.标签名 | 得到匹配到的第一个tag对象 |
Tag.name | 得到这个tag的标签名 |
Tag.attrs | 字典的形式返回属性键值对 |
Tag[‘key’] | 得到这个Tag对象key属性对应的value |
3.2 NavigableString文本对象
string = bs.div.a.string
print(string)
print(type(string))
text = bs.div.a.get_text()
print(text)
print(type(text))
命令 | 解释 |
---|
Tag.string | 返回html中的元素,类型是NavigableString |
Tag.get_text() | 返回html中的元素,类型是str |
3.3 BeautifulSoup对象
print(type(bs))
3.4 Comment注释对象
obj1 = bs.div.h1.string
obj2 = bs.div.h2.string
print(obj1)
print(type(obj1))
print(obj2)
print(type(obj2))
obj3 = bs.div.h2.get_text()
print(obj3)
print(type(obj3))
| .string | .get_text() |
---|
元素只有注释 | 得到注释内容(去掉注释符号)对象类型:Comment | 得到注释内容(去掉注释符号)对象类型:str |
元素只有文本 | NavigableString对象的文本 | str对象的文本 |
元素既有注释又有文本 | None | 只得到文本内容 |
四、两类节点
4.1子节点contents和children
contents = bs.div.contents
print(contents)
children = bs.div.children
print(children)
for obj in children:
print(obj)
命令 | 解释 |
---|
.contents | 返回一个以行为划分方式的列表 |
.children | 返回一个list生成器对象 |
4.2 子孙节点descendants
descendants = bs.div.descendants
print(descendants)
for obj in decendants:
print(obj)
五、搜索方式
5.1 find和find_all
- find返回匹配成功的第一个对象,find_all返回所有
5.1.1构造方法:
def find(self, name=None, attrs={}, recursive=True, text=None,
**kwargs):
def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):
参数 | 解释 |
---|
name | 标签值 |
attrs | 属性值 |
recursive | 是否递归遍历所有子孙节点 |
text | 文本 |
limit | 结果的数量 |
**kwargs | 属性值 |
5.1.2 name——str和list:
ret1 = bs.find_all(name=”div”)
print(ret1)
ret2 = bs.find_all(name=[“a”,”span”])
print(ret2)
5.1.3 name——正则 :
import re
pattern = re.compile(‘.*?iv’)
ret3 = bs.find(name=pattern)
print(ret3)
5.1.4 attrs 和**kwagrs:
ret4 = bs.find_all(attrs = {“class”:”a_class1”})
print(ret4)
ret5 = bs.find(class = “a_class1”)
ret6 = bs.find(id = “….”)
5.1.5 text
taobao = bs.find_all(text='taobao')
print(taobao)
baidu = bs.find_all(text='baidu')
print(baidu)
pattern = re.compile('\n*?.*?baidu\n*?.*?')
baidu = bs.find_all(text=pattern)
print(baidu)
5.2 select和select_all(css选择器)
select_one返回满足的第一个,select返回所有
5.2.1构造方法:
def select_one(self, selector, namespaces=None, **kwargs):
def select(self, selector, namespaces=None, limit=None, **kwargs):
5.2.2根据标签查找
div = bs.select(“div”)
print(div)
print(len(div))
print(div[0].prettify())
print(div[0].get_text())
5.2.3 根据类名查找
class1 = bs.select(“.a_class1”)
5.2.4 根据id查找
id1 = bs.select(‘
5.2.5 组合查找
print(‘div > .a_class1’)
print(‘.div_class1 > .a_class1’)
print(‘div > a’)
print(‘
注:
- >可以省略,只写空格
- 组合查找并不是并列关系,而是递进
5.2.6属性查找
print(bs.select(‘a[class=”a_class1”]’))
print(bs.select(‘div > a[class=”a_class1”]’))