CSS 选择器：BeautifulSoup4

最新推荐文章于 2023-02-24 22:33:19 发布

lowson0810

最新推荐文章于 2023-02-24 22:33:19 发布

阅读量361

点赞数

分类专栏： web学习

web学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

和 lxml 一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据。

lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持 CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器。

Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可： pip install beautifulsoup4

官方文档： http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

抓取工具	速度	使用难度	安装难度
正则	最快	困难	无（内置）
BeautifulSoup	慢	最简单	简单
lxml	快	简单	一般

示例：

首先必须要导入 bs4 库

beautiful soup 里有几个参数，第一个是响应内容，也就是字符串；第二个参数可以指定用那种解析库，我们可以仍然使用beautiful soup 的语法并用lxml库来解析内容，以提升效率。 soup = BeautifulSoup(html,“lxml”)

# beautifulsoup4_test.p；

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href=" http://example.com/elsie " class="sister" id="link1"></a>,

<a href=" http://example.com/lacie " class="sister" id="link2">Lacie</a> and

<a href=" http://example.com/tillie " class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

#创建 Beautiful Soup 对象

soup = BeautifulSoup(html)

#打开本地 HTML 文件的方式来创建对象 #soup = BeautifulSoup(open('index.html'))

#格式化输出 soup 对象的内容

print soup.prettify()

四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag 标签，它有两个重要的属性，是 name 和 attrs
NavigableString
BeautifulSoup
Comment

搜索文档树

1.find_all(name, attrs, recursive, text, **kwargs)

1）name 参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

A.传字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的 标签:

 
  soup.find_all( 
  'b' 
  ) 
 
  # [<b>The Dormouse's story</b>] 
 
  print 
  soup.find_all( 
  'a' 
  ) 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>, <a class="sister" href=" 
  http://example.com/lacie 
  " id="link2">Lacie</a>, <a class="sister" href=" 
  http://example.com/tillie 
  " id="link3">Tillie</a>]

B.传正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示 <body> 和 标签都应该被找到

 
  import 
  re 
 
 
  for 
  tag 
  in 
  soup.find_all( 
  re.compile 
  ( 
  "^b" 
  )): 
 
 
  print(tag.name) 
 
 
  # body 
  # b 
 

C.传列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有 <a> 标签和 标签:

 
  soup.find_all([ 
  "a" 
  , 
  "b" 
  ]) 
 
 
  # [<b>The Dormouse's story</b>, 
  # <a class="sister" href=" 
  http://example.com/elsie 
  " id="link1">Elsie</a>, 
  # <a class="sister" href=" 
  http://example.com/lacie 
  " id="link2">Lacie</a>, 
  # <a class="sister" href=" 
  http://example.com/tillie 
  " id="link3">Tillie</a>] 
 

2）keyword 参数

 
  soup.find_all(id= 
  'link2' 
  ) 
 
 
  # [<a class="sister" href=" 
  http://example.com/lacie 
  " id="link2">Lacie</a>] 
 

3）text 参数

通过 text 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表

 
  soup.find_all(text= 
  "Elsie" 
  ) 
 
  # [u'Elsie'] 
 
  soup.find_all(text=[ 
  "Tillie" 
  , 
  "Elsie" 
  , 
  "Lacie" 
  ]) 
 
  # [u'Elsie', u'Lacie', u'Tillie'] 
 
  soup.find_all(text= 
  re.compile 
  ( 
  "Dormouse" 
  )) 
 
  [ 
  u"The Dormouse's story" 
  , 
  u"The Dormouse's story" 
  ]

CSS选择器

这就是另一种与 find_all 方法有异曲同工之妙的查找方法.

写 CSS 时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

（1）通过标签名查找

 
  print 
  soup.select( 
  'title' 
  ) 
 
  #[<title>The Dormouse's story</title>] 
 
  print 
  soup.select( 
  'a' 
  ) 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>, <a class="sister" href=" 
  http://example.com/lacie 
  " id="link2">Lacie</a>, <a class="sister" href=" 
  http://example.com/tillie 
  " id="link3">Tillie</a>] 
 
  print 
  soup.select( 
  'b' 
  ) 
 
  #[<b>The Dormouse's story</b>]

（2）通过类名查找

 
  print 
  soup.select( 
  '.sister' 
  ) 
 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>, <a class="sister" href=" 
  http://example.com/lacie 
  " id="link2">Lacie</a>, <a class="sister" href=" 
  http://example.com/tillie 
  " id="link3">Tillie</a>] 
 

（3）通过 id 名查找

 
  print 
  soup.select( 
  '#link1' 
  ) 
 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>] 
 

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

 
  print 
  soup.select( 
  'p #link1' 
  ) 
 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>] 
 

直接子标签查找，则使用 > 分隔

 
  print 
  soup.select( 
  "head > title" 
  ) 
 
  #[<title>The Dormouse's story</title>]

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

 
  print 
  soup.select( 
  'a[class="sister"]' 
  ) 
 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>, <a class="sister" href=" 
  http://example.com/lacie 
  " id="link2">Lacie</a>, <a class="sister" href=" 
  http://example.com/tillie 
  " id="link3">Tillie</a>] 
 
 
  
 
 
 
  print 
  soup.select( 
  'a[href=" 
  http://example.com/elsie 
  "]' 
  ) 
 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>] 
 

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

 
  print 
  soup.select( 
  'p a[href=" 
  http://example.com/elsie 
  "]' 
  ) 
 
 
  #[<a class="sister" href=" 
  http://example.com/elsie 
  " id="link1"><!-- Elsie --></a>]