BeautifulSoup库详解

安装:pip3 install beautifulsoup4

解析库 

解析器使用方法优势劣势
Python标准库BeautifulSoup(markup,"html.parser")Python的内置标准库、执行速度适中、文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器BeautifulSoup(markup,"lxml")速度快、文档容错能力强需要安装C语言库
lxml XML 解析器BeautifulSoup(markup,"xml")速度快、唯一支持XML的解析器需要安装C语言库
html5libBeautifulSoup(markup,"html5lib")最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢、不依赖外部扩展

基本使用 

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.prettify()) #格式化代码的方法
print(soup.title.string)    #打印出title里面的内容

运行结果:
①:
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
②:The Dormouse's story

标签选择器

选择元素

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.title)   #打印出title标签
print(type(soup.title)) #打印出title标签的类型
print(soup.head)    #打印出head标签
print(soup.p)   #打印出p标签

运行结果:
①:<title>The Dormouse's story</title>
②:<class 'bs4.element.Tag'>
③:<head><title>The Dormouse's story</title></head>
④:<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

获取名称

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.title.name)  #打印出标签的名称

运行结果:
title

获取属性

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.attrs['name']) #打印p标签里面的name属性
print(soup.p['name'])   #另一种打印p标签里name属性的方法

运行结果:
dromouse
dromouse

获取内容

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.string) #打印出p标签里面的内容

运行结果:
The Dormouse's story

嵌套选择

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.head.title.string) #用层层迭代的方式往下选择,选择head里面的title标签的内容

运行结果:
The Dormouse's story

子节点和子孙节点

html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.contents)  #获取p标签的所有子节点,并以列表的形式返回

运行结果:
['\n            Once upon a time there were three little sisters; and their names were\n            
', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n            
and\n            ', <a class="sister" href="http://example.com/tillie" 
id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']



PS:下面为另一种方法获取子节点

html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.children)  #(.children)实际上是一个迭代器,并不是一个列表的形式,需要用循环的方式获取内容,获取子节点
for i,child in enumerate(soup.p.children):  #用enumerate方法将其子节点遍历输出,可以返回节点内容和索引
    print(i,child)

运行结果:
<list_iterator object at 0x0000028DA218BBE0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.
ps:获取子孙节点

html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.descendants)   #返回结果依然是迭代器,获取p标签的所有子节点和子孙节点,会将子孙节点也单独列出来
for i,child in enumerate(soup.p.descendants):
    print(i,child)

运行结果:
<generator object Tag.descendants at 0x0000015CFAA0A318>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

父节点和祖先节点

html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.a.parent)    #获取a标签的父节点

运行结果:
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(list(enumerate(soup.a.parents)))   #(.parents)方法,获取a标签所有的父节点和祖先节点

运行结果:
[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>), (1, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>)]

兄弟节点

html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(list(enumerate(soup.a.next_siblings))) #获取后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))    #获取前面的兄弟节点

运行结果:
①:[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), 
(2, '\n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they 
lived at the bottom of a well.\n        ')]
②:[(0, '\n            Once upon a time there were three little sisters; and their names 
were\n 

标准选择器

find_all(name,attrs,recursive,text,**kwargs)   ※类似于正则表达式,找出所有能匹配成功的结果以列表的形式返回
可根据标签名、属性、内容查找文档

查找name(标签名):

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all('ul'))  #传入ul标签名,获取所有的ul标签;根据标签名查找
print(type(soup.find_all('ul')[0])) #把ul标签的第一个单独拿出来看一下类型是什么

运行结果:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.find_all('ul'):  #将所有ul标签拿出来遍历
    print(ul.find_all("li"))    #提取出ul标签里面的li标签

运行结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs:根据属性进行查找

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all(attrs={'id':'list-1'})) #attrs传入的类型是字典类型,键名是属性名称,键值是属性的值
print(soup.find_all(attrs={'name':'elements'})) #返回的是一个元素的列表

运行结果:
①:[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
②:[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
PS:另一种根据属性查找的方法
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all(id="list-1")) #可以查找特殊的属性
print(soup.find_all(class_="element"))

运行结果:
①:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
②:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li 
class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text:根据文本的内容进行选择

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all(text="Foo"))    #返回的是标签里面的内容

运行结果:
['Foo', 'Foo']

find(name,attrs,recursive,text,**kwargs) :使用方法与find_all一样,区别只有返回的元素

find返回单个元素,find_all返回所有元素 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

运行结果:
①:
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
②:
<class 'bs4.element.Tag'>
③:
None

find_parents():使用方法与find_all一样

返回所有祖先节点 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumers = soup.find_all(class_ = 'element') #首先找到所有要找父类的标签;如果用find方法,下面这一句可以省略
primaryconsumer = primaryconsumers[0] #将这个标签列表的第一个标签赋值
print(primaryconsumer.find_parents('div')) #查找这个标签的所有祖先节点,以列表的形式返回

运行结果:
[<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>, <div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]

find_parent():使用方法与find_all一样

返回直接父亲节点 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_ = 'element') #定位到属性是element的标签
print(primaryconsumer.find_parent('div')) #查找这个标签的父亲节点

运行结果:
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>

find_next_siblings():

返回后面所有兄弟节点

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_ = 'element') #定位属性为element的标签
print(primaryconsumer.find_next_siblings('li')) #查找这个标签后面所有的兄弟标签,并以列表的形式输出

运行结果:
[<li class="element">Bar</li>, <li class="element">Jay</li>]

find_next_sibling():

返回后面第一个兄弟节点

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_ = 'element') #定位属性为element的标签
print(primaryconsumer.find_next_sibling('li')) #查找这个标签后面第一个兄弟标签

运行结果:
<li class="element">Bar</li>

find_previous_siblings():

返回前面所有的兄弟节点 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumers = soup.find_all(class_ = 'element') #查找所有属性为element的标签
primaryconsumer = primaryconsumers[2] #定位第三个标签
print(primaryconsumer.find_previous_siblings('li')) #查找这个标签前面所有的兄弟节点

运行结果:
[<li class="element">Bar</li>, <li class="element">Foo</li>]

 find_previous_sibling():

返回前面第一个兄弟节点 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumers = soup.find_all(class_ = 'element') #查找所有属性为element的标签
primaryconsumer = primaryconsumers[2] #定位第三个标签
print(primaryconsumer.find_previous_sibling('li')) #查找这个标签前面第一个兄弟节点

运行结果:
<li class="element">Bar</li>

find_all_next():

返回节点后所有符合条件的节点

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find("div") #定位一个div的标签
print(primaryconsumer.find_all_next('li')) #返回这个标签后面的所有li标签

运行结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li 
class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

 find_next():

 返回第一个符合条件的节点

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find("div") #定位一个div的标签
print(primaryconsumer.find_next('li')) #返回这个标签后面的第一个li标签

运行结果:
<li class="element">Foo</li>

find_all_previous():

返回节点前所有符合条件的标签

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_="list") #定位一个属性为list的标签
print(primaryconsumer.find_all_previous('div')) #返回这个标签前面所有的div标签

运行结果:
[<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>, <div class="panel-heading">
<h4>Hello</h4>
</div>, <div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]

find_previous():

返回第一个符合条件的标签 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_="list") #定位一个属性为list的标签
print(primaryconsumer.find_previous('div')) #返回这个标签前面第一个div标签

运行结果:
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>

CSS选择器

通过select()直接传入CSS选择器即可完成选择
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.select('.panel .panel-heading')) #如果传入得是class,选择器前面加.
print(soup.select('ul li')) #如果传入的是标签,前面不需要加入任何内容
print(soup.select("#list-2 .element")) #如果传入得是id,那么前面加#
print(type(soup.select('ul')[0])) #打印输出节点的类型

运行结果:
①:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
②:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li 
class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
③:
[<li class="element">Foo</li>, <li class="element">Bar</li>]
④:
<class 'bs4.element.Tag'>  ※:bs4.element.Tag表示可以进行嵌套选择

 

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.select('ul'):    #层层迭代方法
    print(ul.select('li'))

运行结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li 
class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

PS:另一种方法

获取属性

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.select('ul'):
    print(ul['id'])     #获取ul的id属性,结果一下面一致,方法不同结果一致
    print(ul.attrs['id'])

运行结果:
list-1
list-1
list-2
list-2

获取内容

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.select('li'): #循环遍历li标签
    print(ul.get_text())    #用get_text方法获取标签里面的文本

运行结果:
Foo
Bar
Jay
Foo
Bar

总结:

  •  推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all()查询匹配单个结果或多个结果
  • 如果对CSS选择器熟悉建议使用select()
  • 记住常用的获取属性和文本值得方法
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值