Python爬虫入门4

最新推荐文章于 2024-04-24 15:45:08 发布

Huelse

最新推荐文章于 2024-04-24 15:45:08 发布

阅读量205

点赞数 1

分类专栏： Python 文章标签： Python beautifulsoup4 爬虫

本文链接：https://blog.csdn.net/u011532601/article/details/94560494

版权

Python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Python爬虫入门4

一、一般数据格式
二、 Python解析器
三、BeautifulSoup4
- 1. 遍历文档树
- 2. 搜索文档树

一、一般数据格式

XML
<name>Huelse</name>
HTML
<html></html>
JSON
{"name": "Huelse"}

二、 Python解析器

Python解析器
论效率，我们一般首选lxml HTML解析器，其次是html.parser。

三、BeautifulSoup4

为什么使用BeautifulSoup4

BeautifulSoup能给我们提供一些列查找文档树的方法，让我们能快速定位到我们想要爬取的数据。我们再回想一下之前使用的re模块，它可以全局查找我们想要的文本，从文本开头到结束开始匹配，然后通过贪婪匹配，再通过非贪婪匹配拿到需要的数据，整个过程非常繁琐，而却搜索效率极低！
BeautifulSoup内既封装了re，还为我们提供了一些更加强大、高效的功能。我们可以快速匹配到我们想要的数据，提高爬取效率和开发效率。

安装
- 解析器 pip install lxml
- 解析库 pip install beautifulsoup4
使用
from bs4 import BeautifulSoup
soup_res = BeautifulSoup(html_doc, 'html.parser') # Python自带
或
soup_res = BeautifulSoup(html_doc, 'lxml') # 更快

1. 遍历文档树

直接使用
获取标签的名称
获取标签的属性
获取标签的内容
嵌套选择
子节点、子孙节点
父节点、祖先节点
兄弟节点

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="sister"><b>$37</b></p>
<p class="story" id="p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" >Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# python自带 html.parser 解析库
# soup = BeautifulSoup(html_doc, 'html.parser')

soup_res = BeautifulSoup(html_doc, 'lxml')

'''
遍历文档树
'''
# 1、直接使用
print(soup_res.a)
print(soup_res.p)

# 2、获取标签的名称
print(soup_res.a.name)

# 3、获取标签的属性
print(soup_res.a.attrs)  # 获取a标签中的所有属性
print(soup_res.a.attrs['href'])

# 4、获取标签的内容
print(soup_res.a.text)

# 5、嵌套选择
print(soup_res.html.body.p)

# 6、子节点、子孙节点
print(soup_res.p.children)  # 返回迭代器对象
print(list(soup_res.p.children))

# 7、父节点、祖先节点
print(soup_res.b.parent)
print(soup_res.b.parents)  # 返回生成器对象
# print(list(soup_res.b.parents))

# 8、兄弟节点
print(soup_res.a.next_sibling)  # 下一个兄弟节点
print(soup_res.a.next_siblings)  # 下一个所有兄弟节点,返回生成器
# print(list(soup_res.a.next_siblings))


print(soup_res.a.previous_sibling)  # 上一个兄弟节点
print(soup_res.a.previous_siblings)  # 上一个所有兄弟节点,返回生成器
# print(list(soup_res.a.previous_siblings))

2. 搜索文档树

找第一个 find()，找所有的 find_all()
标签查找与属性查找:

标签

字符串过滤器字符串全局匹配
name 属性匹配
attrs 属性查找匹配
text 文本匹配
正则过滤器
re模块匹配
列表过滤器
列表内的数据匹配
bool过滤器
True匹配
方法过滤器
用于一些要的属性以及不需要的属性查找

属性

class_
id

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="sister"><b>$37</b></p>
<p class="story" id="p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" >Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# python自带 html.parser 解析库
# soup = BeautifulSoup(html_doc, 'html.parser')

soup_res = BeautifulSoup(html_doc, 'lxml')

'''
遍历文档树
'''
# 1、直接使用
print(soup_res.a)
print(soup_res.p)

# 2、获取标签的名称
print(soup_res.a.name)

# 3、获取标签的属性
print(soup_res.a.attrs)  # 获取a标签中的所有属性
print(soup_res.a.attrs['href'])

# 4、获取标签的内容
print(soup_res.a.text)

# 5、嵌套选择
print(soup_res.html.body.p)

# 6、子节点、子孙节点
print(soup_res.p.children)  # 返回迭代器对象
print(list(soup_res.p.children))

# 7、父节点、祖先节点
print(soup_res.b.parent)
print(soup_res.b.parents)  # 返回生成器对象
# print(list(soup_res.b.parents))

# 8、兄弟节点
print(soup_res.a.next_sibling)  # 下一个兄弟节点
print(soup_res.a.next_siblings)  # 下一个所有兄弟节点,返回生成器
# print(list(soup_res.a.next_siblings))


print(soup_res.a.previous_sibling)  # 上一个兄弟节点
print(soup_res.a.previous_siblings)  # 上一个所有兄弟节点,返回生成器
# print(list(soup_res.a.previous_siblings))

import re
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="sister"><b>$37</b></p>
<p class="story" id="p">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" >Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# python自带 html.parser 解析库
# soup = BeautifulSoup(html_doc, 'html.parser')

soup_res = BeautifulSoup(html_doc, 'lxml')

'''
搜索文档树
find()找第一个 find_all()找所有
'''

'''
字符串过滤器
'''
# 查找标签为p的
print(soup_res.find(name='p'))
# print(soup_res.find_all(name='p'))

# name + attrs
print(soup_res.find(name='p', attrs={'id': 'p'}))

# name + text
print(soup_res.find(name='title', text="The Dormouse's story"))

# name + class
print(soup_res.find(name='a', attrs={'class': 'sister'}))

print('-' * 50)
'''
正则过滤器
'''
# re匹配模块
# 找带有字幕a的节点
print(soup_res.find(name=re.compile('a')))

print(soup_res.find_all(name=re.compile('a')))

print(soup_res.find(attrs={'id': re.compile('link')}))
print(soup_res.find_all(attrs={'id': re.compile('link')}))

print('-' * 50)
'''
列表过滤器
'''
# 列表内的数据匹配
print(soup_res.find(name=['a', 'b', 'html', re.compile('a')]))
print(soup_res.find_all(name=['a', 'b', 'html', re.compile('a')]))

print('-' * 50)
'''
方法过滤器
'''


# print(soup_res.find_all(name=函数对象))
def foo(tag):
    print(tag.name)
    if tag.name == 'a' and not tag.has_attr('id') and tag.has_attr('class'):
        return tag


print(soup_res.find_all(name=foo))

print(soup_res.find(class_='sister'))
print(soup_res.find(id='link3'))

Huelse

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫入门4

Python爬虫入门4一、一般数据格式二、 Python解析器三、BeautifulSoup41. 遍历文档树2. 搜索文档一、一般数据格式XML<name>Huelse</name>HTML<html></html>JSON{"name": "Huelse"}二、 Python解析器论效率，我们一般首选lxml HTML...
复制链接

扫一扫