Day32.爬虫基础之BeautifulSoup

晶晶家的小可爱

于 2021-04-12 22:32:41 发布

阅读量112

点赞数

分类专栏： 100 Days With Python 文章标签： python xml dom html css

本文链接：https://blog.csdn.net/Tomandjava/article/details/115643862

版权

100 Days With Python 专栏收录该内容

43 篇文章 6 订阅

订阅专栏

爬虫基础之BeautifulSoup

文章目录

爬虫基础之BeautifulSoup
前言
一. BeautifulSoup 基本操作
二. BeautifulSoup 高级操作
总结

前言

本文主要展示了BeautifulSoup库的一些基础知识以及常见使用。

一. BeautifulSoup 基本操作

1.1 解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

1.2 基本使用

from bs4 import BeautifulSoup

html = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())  # 自动调整 html 的格式
print(soup.title.string)  # 得到 title 的内容

1.3 标签选择器

# 获取名称
print(soup.title.name)

# 获取属性
print(soup.p.attrs['name'])
print(soup.p['name'])

# 获取内容
print(soup.p.string)  # 获取 p 标签里面的字符串

# 嵌套选择
print(soup.head.title.string)  # 获取深层里面的文本

# ++++++++++++++++++++++++++++++++++++++++++++++++++++++ #
# 子节点和子孙结点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)  # p 标签的所有子标签的内容以列表形式返回


print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)  # 迭代器 输出

print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):  # 获取子孙结点（迭代输出）
    print(i, child)

# 父节点和祖先节点
print(soup.a.parent)
print(list(enumerate(soup.a.parents)))


# 兄弟结点
print(list(enumerate(soup.a.next_siblings)))  # 后面的兄弟结点（同一等级）
print(list(enumerate(soup.a.previous_siblings)))  # 前面的兄弟结点。

二. BeautifulSoup 高级操作

2.1 标准选择器

find_all( name , attrs , recursive , text , **kwargs ) 可根据标签名、属性、内容查找文档

# name

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')

print(soup.find_all('ul'))  # 用标签名查找
print(type(soup.find_all('ul')[0]))

for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

# ++++++++++++++++++++++++++++++++++++++++++++++++++++++ #

# attrs

print(soup.find_all(attrs={'id': 'list-1'}))  # 传入字典形式的值
print(soup.find_all(attrs={'name': 'elements'}))

print(soup.find_all(id='list-1'))  # 直接用标签计算
print(soup.find_all(class_='element'))

# ++++++++++++++++++++++++++++++++++++++++++++++++++++++ #

# text

print(soup.find_all(text='Foo'))  # 查找文本为啥，特定查找

find( name , attrs , recursive , text , **kwargs ) find返回单个元素，find_all返回所有元素.

print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

find_parents() find_parent()
find_parents()返回所有祖先节点，find_parent()返回直接父节点。
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。
find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点.
find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点.

2.2 CSS选择器

通过select()直接传入CSS选择器即可完成选择.

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

for ul in soup.select('ul'):
    print(ul.select('li'))

# 获取属性
for ul in soup.select('ul'):
    print(ul['id'])  # 获取属性
    print(ul.attrs['id'])

# 获取内容
for li in soup.select('li'):
    print(li.get_text())  # 获取文本 get_text()