BeautifulSoup 的使用

最新推荐文章于 2024-02-21 20:10:21 发布

落日晚风愿能与你共赏

最新推荐文章于 2024-02-21 20:10:21 发布

阅读量263

点赞数

文章标签： python xpath

本文链接：https://blog.csdn.net/qq_45659384/article/details/106695039

版权

六、BeautifulSoup 的使用

Beautiful Soup 是⼀个可以从HTML或XML⽂件中提取数据的⽹⻚信息提取库

6.1 基本使用方法

方法	功能
BeautifulSoup(html_doc,‘lxml’) `	获取bs对象
bs.prettify()	打印文档内容
bs.title(标签名)	获取标签内容
bs.title.name	获取标签名称
bs.title.string	获取标签里面的文本内容

6.2 bs4的对象种类

对象	种类
tag	标签
NavigableString	可导航的字符串
BeautifulSoup	bs对象
Comment	注释

 from bs4 import BeautifulSoup
  
  html_doc = """
  <html><head><title>The Dormouse's story</title></head>
  <body>
  <p class="title"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.</p>
  <p class="story">...</p>
  """
  
  soup = BeautifulSoup(html_doc,'lxml')
  
  # print(type(soup)) # <class 'bs4.BeautifulSoup'>
  #
  # print(type(soup.title)) # <class 'bs4.element.Tag'>
  # print(type(soup.a)) # <class 'bs4.element.Tag'>
  # print(type(soup.p)) # <class 'bs4.element.Tag'>
  #
  # print(soup.p.string) # The Dormouse's story
  # print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>
  title_tag = soup.p
  print(title_tag)
  print(title_tag.name)
  print(title_tag.string)
  html_comment = '<a><!-- 这里是注释内容--></a>'
  soup = BeautifulSoup(html_comment,'lxml')
  print(soup.a.string)
  print(type(soup.a.string)) # <class 'bs4.element.Comment'>

6.3 遍历树遍历子节点

bs⾥⾯有三种情况，第⼀个是遍历，第⼆个是查找，第三个是修改

contents children descendants
- contents 返回的是⼀个列表
- children 返回的是⼀个迭代器通过这个迭代器可以进⾏迭代
- descendants 返回的是⼀个⽣成器遍历⼦⼦孙孙
.string .strings .stripped_strings
- string获取标签⾥⾯的内容
- strings 返回是⼀个⽣成器对象⽤过来获取多个标签内容
- stripped strings 和strings基本⼀致但是它可以把多余的空格去掉

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

# tag
# print(soup.title)
# print(soup.p)
# print(soup.p.b)
# print(soup.a)

# all_p = soup.find_all('p')
#
# print(all_p)
# []属性来取值
title_tag = soup.p
print(title_tag['class'])#['title']

# contents 返回的是一个列表

# children 返回的是一个迭代器通过这个迭代器可以进行迭代
# 迭代 重复 循环(loop)
# python当中 循环 while for 实现迭代 for ... in ...
# 在Python中可以使用for关键字来逐个访问可迭代对象



# descendants 返回的是一个生成器遍历子子孙孙


# contents 返回的是一个列表

# links = soup.contents
# print(type(links)) # <class 'list'>
# print(links)

# children 返回的是一个迭代器通过这个迭代器可以进行迭代

html = '''
<div>
<a href='#'>百度</a>
<a href='#'>阿里</a>
<a href='#'>腾讯</a>
</div>
'''
# 需要div标签下的数据
soup2 = BeautifulSoup(html,'lxml')

# links = soup2.contents

# print(type(links))
#
# print(links)

# for i in links:
#
#     print()

# links = soup2.div.children
# print(type(links)) # <class 'list_iterator'>
#
# for link in links:
#     print(link)

# descendants 返回的是一个生成器遍历子子孙孙

# print(len(soup.contents))
# # print(len(soup.descendants)) # TypeError: object of type 'generator' has no len()
#
# for x in soup.descendants:
#
#     print('----------------')
#     print(x)

# string获取标签里面的内容
# strings 返回是一个生成器对象用过来获取多个标签内容
# stripped strings 和strings基本一致 但是它可以把多余的空格去掉

# title_tag = soup.title
# print(title_tag)
# print(title_tag.string)
#
# head_tag = soup.head
# print(head_tag.string)
#
# print(soup.html.string)

# strings = soup.strings


# print(strings) # <generator object _all_strings at 0x000001D9053745C8>

# for s in strings:
#     print(s)

strings = soup.stripped_strings

for s in strings:
    print(s)

6.4 遍历树遍历父节点

parent直接获得⽗节点
parents获取所有的⽗节点

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')
# parent直接获得父节点
# title_tag = soup.title
# print(title_tag)
# print(title_tag.parent)

# print(soup.html.parent)

# parents获取所有的父节点
a_tag = soup.a
# print(a_tag)
# print(a_tag.parents) # <generator object parents at 0x0000025F937E9678>

for x in a_tag.parents:
    print(x)
    print('----------------')

6.5 遍历树遍历兄弟节点

next_sibling 下⼀个兄弟结点
previous_sibling 上⼀个兄弟结点
next_siblings 下⼀个所有兄弟结点
previous_siblings上⼀个所有兄弟结点

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# html = '<a><b>bbb</b><c>ccc</c><a>'
soup = BeautifulSoup(html_doc,'lxml')
#
# # print(soup.prettify())
# b_tag = soup.b
# print(b_tag)
# print(b_tag.next_sibling)
# c_tag = soup.c
# # print(c_tag.next_sibling)
# print(c_tag.previous_sibling)

a_tag = soup.a
# print(a_tag)

for x in a_tag.next_siblings:
    print(x)

6.6 搜索树

字符串过滤器
正则表达式过滤器：我们⽤正则表达式⾥⾯compile⽅法编译⼀个正则表达式传给 find 或者
findall这个⽅法可以实现⼀个正则表达式的⼀个过滤器的搜索
列表过滤器
True过滤器
⽅法过滤器

from bs4 import BeautifulSoup
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


soup = BeautifulSoup(html_doc,'lxml')


# • 字符串过滤器
# • 正则表达式过滤器
#   我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索
# • 列表过滤器
# • True过滤器
# • 方法过滤器

# • 字符串过滤器
# a_tag2 = soup.a
#

# a_tags = soup.find_all('a')
# print(a_tags)

# 我想要找到所有t 打头的标签 正则表达式
# print(soup.find_all(re.compile('t')))

# 我想要找p标签和a标签  列表过滤器
# print(soup.find_all(['p','a']))

# print(soup.find_all(['title','b']))

# print(soup.find_all(True)) # True过滤器

def fn(tag):
    return tag.has_attr('class')

print(soup.find_all(fn))

6.7 复习

方法	功能
soup.prettify()	格式化源码
soup.title	title整个标签的内容
soup.title.name	标签的名字
soup.title.string	标签内容
soup.contents	返回一个列表
soup.div.children	返回一个迭代器
soup.descendants	返回一个生成器遍历子子孙孙
soup.string	获取标签内容
soup.string	获取所有标签内容
soup.stripped_strings	获取所有标签内容并删除多余空格
soup.a.parent	获取a标签的父节点
soup.a.previous_sibling	上一个兄弟节点
soup.a.next_sibling	下一个兄弟节点
soup.a.next_siblings	下一个所有兄弟节点
soup.a.previous_siblings	上一个所有兄弟节点

6.8 find

函数	功能
find(‘标签’,class=‘属性’)	查找单个标签
find_all()	查找所有标签
find_parents()	搜索所有⽗亲
find_parrent()	搜索单个⽗亲
find_next_siblings()	搜索所有兄弟
find_next_sibling()	搜索单个兄弟
find_previous_siblings()	往上搜索所有兄弟
find_previous_sibling()	往上搜索单个兄弟
find_all_next()	往下搜索所有元素
find_next()	往下查找单个元素

find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)
name : tag 名称
attrs ：标签的属性
recursive : 是否递归
text : 文本内容
limit : 限制返回的条数
**kwargs ：不定长参数 以关键字来传参

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


soup = BeautifulSoup(html_doc,'lxml')

# find_all(self, name=None, attrs={}, recursive=True, text=None,
#                  limit=None, **kwargs)
# name : tag 名称
# attrs ：标签的属性
# recursive : 是否递归
# text : 文本内容
# limit : 限制返回的条数
# **kwargs ：不定长参数 以关键字来传参
# a_tags = soup.find_all('a')

# p_tags = soup.find_all('p','title')



# print(soup.find_all(id = 'link1'))
# print(soup.find_all('a',limit=2))
# print(soup.a)
# print(soup.find('a'))

# print(soup.find_all('a',recursive=True))

# print(soup.find_all('a',limit=1)[0])
# print(soup.find('a'))

# find_parents() 搜索所有父亲
# find_parrent() 搜索单个父亲
# find_next_siblings()搜索所有兄弟
# find_next_sibling()搜索单个兄弟

title_tag = soup.title

# print(title_tag.find_parent('head')) # <head><title>The Dormouse's story</title></head>

s = soup.find(text = 'Elsie')

# print(s.find_previous('p'))
# print(s.find_parents('p'))

# a_tag = soup.a
#
# # print(a_tag)
# #
# # print(a_tag.find_next_sibling('a'))
#
# print(a_tag.find_next_siblings('a'))


# find_previous_siblings() 往上搜索所有兄弟
# find_previous_sibling() 往上搜索单个兄弟
# find_all_next() 往下搜索所有元素
# find_next()往下查找单个元素

a_tag = soup.find(id='link3')

# print(a_tag)

# print(a_tag.find_previous_sibling())

# print(a_tag.find_previous_siblings())

p_tag = soup.p

# print(p_tag.find_all_next())

print(p_tag.find_next('a'))

6.9 修改文档树

修改tag的名称和属性
修改string 属性赋值,就相当于⽤当前的内容替代了原来的内容
append() 像tag中添加内容,就好像Python的列表的 .append() ⽅法
decompose() 修改删除段落，对于⼀些没有必要的⽂章段落我们可以给他删除掉

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'lxml')

# 1.修改tag的名称和属性

# tag_p = soup.p
# print(tag_p)

# tag_p.name = 'w' # 修改名称
# tag_p['class'] = 'content' # 修改属性

# print(tag_p)


# 2. 修改string
tag_p = soup.p
# print(tag_p.string)

# tag_p.string = '521 wo ai ni men'

# print(tag_p.string)


# 3.tag.append() 方法 向tag中添加内容

# print(tag_p)
# tag_p.append('hahaha')
# print(tag_p)

# 4.decompose() 修改删除段落

result = soup.find(class_ = 'title')

result.decompose()

print(soup)

6.10 爬取天气网数据

知识点
- find('div',class='conMidtab'):根据标签属性获取数据
- table.find_all('tr')[2:]:过滤掉前两个tr
- enumerate 返回2个值，第一个是下标第二个是下标所对应的元素
- BeautifulSoup中有两种解析方式html和html5lib

import requests

from bs4 import BeautifulSoup

# 定义一个函数来解析网页
def parse_page(url):

    response = requests.get(url)
    # 解决乱码
    text = response.content.decode('utf-8')
    soup = BeautifulSoup(text,'html5lib') # pip install html5lib
    # 网页解析
    # 一、class="conMidtab"
    conMidtab = soup.find('div',class_='conMidtab')
    # print(conMidtab)
    # 二、table
    tables = conMidtab.find_all('table')
    # print(tables)

    for table in tables:
        # print(table)
        # 三、tr 过滤掉去前2个
        trs = table.find_all('tr')[2:]
        # enumerate 返回2个值第一个是下标 第二个下标所对应的元素
        for index,tr in enumerate(trs):
            # print(tr)
            tds = tr.find_all('td')

            # 判断
            city_td = tds[0] # 城市

            if index == 0:
                city_td = tds[1] # 省会


            # 获取一个标签下面的子孙节点的文本信息
            city = list(city_td.stripped_strings)[0]

            temp_td = tds[-2]
            temp = list(temp_td.stripped_strings)[0]
            print('城市:',city,'温度:',temp)
        # break # 先打印北京

    # 四、td

    # print(text)


def main():

    url = 'http://www.weather.com.cn/textFC/hb.shtml' # 华东
    # url = 'http://www.weather.com.cn/textFC/db.shtml' # 东北
    url = 'http://www.weather.com.cn/textFC/gat.shtml' # 港澳台

    urls = ['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/db.shtml' ,'http://www.weather.com.cn/textFC/gat.shtml']

    for url in urls:
        parse_page(url)


if __name__ == '__main__':


    main()