Python爬虫----bs4入门到精通(一)
文章目录
BeautifulSoup4介绍
Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
1、用来解析数据
2、用不同的解析模块处理不同的网页结构的网站
正则:用正则表达式去匹配数据 比较复杂
xpath:语法 节点关系较为难找
bs4:find() find_all() 方法
基本概念
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库
源码分析
c class 类
m method 方法
f field 字段
p 被@property所修饰的方法 可以当成属性来用
v variable 变量
bs4快速入门
一、安装
要先安装 lxml:pip install lxml
再装bs4: pip install bs4
二、导入模块
from bs4 import BeautifulSoup
三、创建soup对象
soup = BeautifulSoup(html_doc, 'lxml')
bs4对象种类
● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释
代码演示,详细注解
# 1、导入模块
from bs4 import BeautifulSoup
"""
仅作了解即可
● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释
"""
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
"""
# 2、创建soup对象
soup = BeautifulSoup(html_doc, features='lxml')
# print(type(soup.title)) # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.a)) # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.p)) # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.body)) # <class 'bs4.element.Tag'> 标签
# print('-' * 30)
# print(type(soup.title.string)) # <class 'bs4.element.NavigableString'> 可导航的字符串
# print('-' * 30)
# print(type(soup)) # <class 'bs4.BeautifulSoup'> bs对象
# print('-' * 30)
# print(type(soup.span.string)) # <class 'bs4.element.Comment'> 注释
遍历文档树
contents,children,descendants
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
代码演示,详细注解
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
"""
"""
contents children descendants
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
a = soup.a
# print(head.contents) # [<title>The Dormouse's story</title>] 回的是一个所有子节点的列表
# print('-' * 30)
# for i in head.contents:
# print(i)
# print(head.children) # <list_iterator object at 0x000001B13C39B708> 返回的是一个子节点的迭代器的对象
# print('-' * 30)
# for i in head.children: # (凡是迭代器 都是可以遍历的)
# print(i)
# print(head.descendants) # <generator object Tag.descendants at 0x000001B13C456648> 返回的是一个生成器遍历子子孙孙
# print('-' * 30)
# for i in a.descendants:
# print(i)
# 会把换行也当成子节点 匹配到
html = soup.html
# print(html.contents)
# print(html.descendants)
"""
[<head><title>The Dormouse's story</title></head>, '\n', <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
</body>]
"""
# for h in html.descendants:
# print(h)
string ,strings,stripped_strings
● string获取标签里面的内容
● strings 返回是一个生成器对象用过来获取多个标签内容
● stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
代码演示,详细注解
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
"""
"""
contents children descendants
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
a = soup.a
"""
需要重点掌握
string strings stripped_strings
● string获取标签里面的内容
● strings 返回是一个生成器对象用过来获取多个标签内容
● stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
"""
# 用来获取标签里的文本内容
# print(soup.title.string)
# 返回是一个生成器对象用过来获取多个标签内容
# 返回一个生成器对象
# print(html.strings) # <generator object Tag._all_strings at 0x000001E54C0D85C8>
# for i in html.strings:
# print(i)
# 和strings基本一致 但是它可以把多余的空格去掉
# print(html.stripped_strings) # <generator object PageElement.stripped_strings at 0x000001A59E8A85C8>
# for i in html.stripped_strings:
# print(i)
'''
<generator object PageElement.stripped_strings at 0x000001D66FDA6648>
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
'''
parent 和 parents
● parent直接获得父节点
● parents获取所有的父节点
代码演示,详细注解
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
"""
"""
contents children descendants
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
a = soup.a
"""
遍历文档树 遍历父节点
parent 和 parents
● parent直接获得父节点
● parents获取所有的父节点
"""
title = soup.title
# parent 找直接父节点
# print(title.parent)
# 返回一个生成器
# print(title.parents) # <generator object PageElement.parents at 0x0000018B3E8C8548>
# for p in title.parents:
# print(p)
'''
1、首先找到title的父节点:<head><title>The Dormouse's story</title></head>
2、紧接着找到父节点的父节点(head的父节点):
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
</body></html>
3、最后找到父节点的父节点的父节点(html的父节点):最后找到父节点的父节点的父节点:
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<span><!--comment注释内容举例--></span>
</body></html>
'''
# html的父节点就是整个文档
find() 和 find_all()----[重点学习]
● find_all()方法以列表形式返回所有的搜索到的标签数据
● find()方法返回搜索到的第一条数据
● find_all()方法参数● name : tag名称
● attr : 标签的属性
● recursive : 是否递归搜索
● text : 文本内容
● limli : 限制返回条数
● kwargs : 关键字参数
代码演示,详细注解
from lxml import etree
from bs4 import BeautifulSoup
import pprint
# html_doc是网页源码
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 字符串过滤器
# 获取所有的a标签,find_all 把所有找到的数据放在列表里面返回
# a_list = soup.findAll('a')
# print(a_list)
# print('-' * 70)
# for a in a_list:
# print(a)
# print('-' * 70)
# print(soup.find('a')) # find()返回匹配到的第一个结果
# 找到title节点和p节点
# result= soup.findAll('title','p')# [] 不可行
result = soup.findAll(['title', 'p'])
print(result)
案例练习,复习总结bs4
from bs4 import BeautifulSoup
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
<tbody>
<tr class="h">
<td class="l" width="374">职位名称</td>
<td>职位类别</td>
<td>人数</td>
<td>地点</td>
<td>发布时间</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-25</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
<td>技术类</td>
<td>4</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
<tr class="odd">
<td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2017-11-24</td>
</tr>
</tbody>
</table>
"""
soup = BeautifulSoup(html, 'lxml')
# 1、拿到所有tr节点
# tr_list = soup.find_all('tr')
# for tr in tr_list:
# print(tr)
# print('-'*74)
# 2、获取第二个tr节点
# print(tr_list[1])
# 3、找到所有class="even"的tr节点
# 方式一
# class_list = soup.find_all('tr',class_='even')
# for c in class_list:
# print(c)
# print('-'*74)
# 方式二 以字典的方式传入
# class_list = soup.find_all('tr', attrs={'class': 'even'})
# for c in class_list:
# print(c)
# print('-' * 74)
# 4、定位到id="test"的a标签
# a_list = soup.find_all('a', id="test")
# for a in a_list:
# print(a)
# print('-'*74)
# a_lists = soup.find_all('a', attrs={'id': "test", 'class': 'test'})
# for a in a_lists:
# print(a)
# print('-' * 74)
# 5、获取所有a标签里面的href属性值
# a_list= soup.find_all('a')
# for a in a_list:
# 推荐使用第一种
# print(a.get('href'))
# print(a.attrs['href'])
# print(a['href'])
# 6、获取所有职位名称
# 第一个tr是标头可以过滤掉
tr_list = soup.find_all('tr')[1:]
for tr in tr_list:
a_list = tr.find_all('a')
print(a_list)