## BeautifulSoup4解析器
和lxml一样,BeautifulSoup也是一个HTML/XML的解析器
| 抓取工具 | 速度 | 使用难度| 安装难度|
| --- | --- | --- | --- |
| 正则 | 最快 | 困难|无(内置)|
| BeautifulSoup| 慢 | 最简单|简单|
| lxml | 快 |简单|一般|
**首先必须要导入bs4库**
```
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = """
"""
# 创建BeautifulSoup对象
soup = BeautifulSoup(html,"lxml")
#打开本地html文件方式创建对象
#soup = BeautifulSoup(open('index.html'),"lxml")
#格式化输出 soup对象的内容
print soup.prettify()
```
**1.Tag**
```
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = """
测试"""
# 创建BeautifulSoup对象
soup = BeautifulSoup(html,"lxml")
print soup.head
#
测试print soup.title
#
测试print soup.a
#广场
print soup.a.text
#广场
```
**2.对于标签有两个属性 name 和attrs**
```
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
html = """
测试"""
# 创建BeautifulSoup对象
soup = BeautifulSoup(html,"lxml")
print soup.a.attrs #{'href': '/explore', 'class': ['item'], 'name': 'test'}
print soup.a['class'] #['item']
print soup.a.get('class') #['item']
#修改属性
soup.a['class'] = 'newClass'
print soup.a #广场
#删除属性
del soup.a['class']
print soup.a #广场
```
**3.获取标签的内容**
```
# 创建BeautifulSoup对象
soup = BeautifulSoup(html,"lxml")
print soup.a.string #广场
print soup.a.text #广场
```
创建 beaufifulsoup 对象
```
soup = BeautifulSoup(html)
```
下面我们来打印soup对象的内容,格式化输出
```
print soup.prettify()
```
**3.找标签**
直接打印标签
```
print soup.title
#
The Dormouse's storyprint soup.head
....
print soup.a
.......
```
对于标签,它有两个重要的属性,是name 和attrs
```
print soup.name
#[document]
print soup.head.name
#head
```
soup 对象本身比较特殊,它的name即为【document】,对于其他内部标签,输出的值便为标签本身的名
```
print soup.p.attrs
#{'class':['title'],'name':'dromouse'}
```
在这里,我们把P标签的所有属性打印输出了出来,得到的类型是一个字典
如果我们想要单独获取某个属性,可以这样,例如我们获取它的class叫什么
```
print soup.p['class']
#['title']
```
**4.获取文字**
```
print soup.p.string
#The Dormouse's story
```
**5.CSS选择器**
1.通过标签名查找
```
print soup.select('title')
#[
The Dormouse's stroy]```
2.通过类名查找
```
print soup.select('.title')
#[
```
3.通过id查找
```
print soup.select('#test')
#[ceshi]
```
4.组合查找
```
print soup.select('p #link1')
```
5.直接子标签查找
```
print soup.select('head > title')
```
6.属性查找
```
print soup.select('a[class="title"]')
print soup.select('a[herf="http://www.baidu.com"]')
```