爬虫第二讲：Beautiful Soup库

最新推荐文章于 2024-08-08 15:42:55 发布

#水木刀

最新推荐文章于 2024-08-08 15:42:55 发布

阅读量452

点赞数 2

文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/m0_61638092/article/details/127952800

版权

第二讲 Beautiful Soup库

一、Beautiful Soup库基础

1.示例引入

#首先爬取下页面
>>>import requests   
>>>r = requests.get('https://python123.io/ws/demo.html')
>>>r.status_code
200
>>>demo = r.text
>>>print(demo)
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

#再利用BeautifulSoup处理
>>>from bs4 import BeautifulSoup
>>>soup = BeautifulSoup(demo,'html.parser')
>>>print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

BeautifulSoup库主要操作为是两行代码

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data<p>','html.parser') #利用BeautifulSoup()解析，有两个参数
#参数'<p>data<p>'指的是html类型的信息
#参数'html.parser'是一个解析器

2.BeautifulSoup基本元素

(1) HTML和BeautifulSoup

BeautifulSoup对应一个HTML/XML文档的全部内容，建立BeautifulSoup的两种方法：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data<p>','html.parser')
soup2 = BeautifulSoup(open('D://demo.html'),'html.parser')

在这里插入图片描述

解析器有四种：

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

HTML 标签参考手册

经过BeautifulSoup处理之后，每一种html的tag（标签）都有soup.tag属性与之对应

当文档中有多个同一种tag标签时，只会返回对一个tag标签的内容

(2) BeautifulSoup类的五种基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾，用soup.可以提取出来相应标签的内容
Name	标签的名字， … 的名字是’p’，格式：.name
Attributes	标签的属性，字典形式组织，每一个标签都有零或多个属性，格式：.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：.string
Comment	标签内字符串的注释部分，一种特殊的Comment类

可以用type(soup.<tag>)查看元素的类型

元素示意图：

在这里插入图片描述

# Tag 返回标签中所有内容
>>>soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

# Name 返回标签名字
>>>soup.a.name
'a'
>>>soup.a.parent.name
'p'

# Attributes 返回字典类型，所以还可以继续索引
>>> soup.a.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>>soup.a.attrs['class']
['py1']
>>>type(soup.a.attrs)
<class 'dict'>

# NavigableString 返回该标签下的字符串
>>>soup.p.string
'The demo python introduces several python courses.'
>>>type(soup.p.string)
<class 'bs4.element.NavigableString'>

#Comment 注释类型，用<tag>.string获取字符串时，注释不会被筛掉，也会被获取，并赋予Comment类型

3.HTML内容的3种遍历方法

HTML基本格式是一个树形结构：

在这里插入图片描述

HTML树形结构有三种遍历方式（遍历顺序不同）

下行遍历
上行遍历
平行遍历

(1) 下行遍历

属性	说明
.contents	子节点的列表类型，将所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

# 遍历方法
>>>soup.body
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>

# 通过.contents方法可以获取tag的所有儿子标签
>>>soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

# .contents获取的是列表类型，可以索引
>>>soup.body.contents[3]
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

# 常用遍历模板
# 1 遍历儿子标签
for child in soup.body.children:
		print(child)
# 2 遍历所有子孙标签
for child in soup.body.children:
		print(child)

(2) 上行遍历

属性	说明
.parent	节点的父亲标签（中的内容）
.parents	节点的先辈标签的迭代类型，用于循环遍历先辈节点

# 上行遍历代码
soup = BeautifulSoup(demo,'html.parser')
for parent in soup.a.parents:
		if parent is None: 
				print(parent)
		else:
				print(parent.name)
# 输出结果
p
body
html
[document]

(3) 上行遍历

在这里插入图片描述

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

#已知soup.a.parent为：
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> 
and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.
</p>

# 1 遍历<a>的后续节点
for sibling in soup.a.next_siblings:
	print(sibling)
# 输出结果
and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

# 2 遍历<a>的前续节点
for sibling in soup.a.previous_siblings:
	print(sibling)
# 输出结果
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

所有在标签树中，字符类型也是节点，如上例中的and和Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:是两个字符型节点。

4.HTML的格式化输出

.prettify()方法可以让HTML页面更友好地输出：该方法会在每个标签和字符串（也相当于节点）后面加一个换行符’\n’，这样可以更清晰地打印出来，示例如下：

>>>soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'

>>>print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

也可以单独对某一标签做相关处理：

>>>print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

二、信息标记与提取方法

1.三种信息标记方式

标记方法	特点	比较	应用
XML	使用标签组织信息，类似于HTML	最早的通用信息标记语言，可扩展性好，但繁琐	Internet上的信息交互与传递
JSON	有效的键值对 key:value；可以用字典嵌套或用键值对列表；不能注释	信息有类型，适合程序处理(js)，较XML简洁	移动应用云端和节点的信息通信，无注释
YAML	无类型键值对；缩进表示所属关系；-表示并列关系；#表示注释	信息无类型，文本信息比例最高，可读性好	各类系统的配置文件，有注释易读

三种标记形式实例如下：

# XML
<person>
	<firstName>Tian</firstName>
	<lastName>Song</lastName>
	<address>
			<streetAddr>中关村南大街5号</streetAddr>
			<city>北京市</city>
			<zipcode>100081</zipcode>
	</address>
	<prof>Computer System</prof><prof>Security</prof>
</person>

# JSON
{
	“firstName” : “Tian” ,
	“lastName” : “Song” ,
	“address” : {
						“streetAddr” : “中关村南大街5号” ,
						“city” : “北京市” ,
						“zipcode” : “100081”
							} ,
	“prof” : [ “Computer System” , “Security” ]
}

#YAML
firstName : Tian
lastName : Song
address :
					streetAddr : 中关村南大街5号
					city : 北京市
					zipcode : 100081
prof :
‐Computer System
‐Securit

2.内容查找

(1) .find_all()方法

<tag>.find_all(name, attrs, recursive, string, **kwargs)返回一个列表类型，存储查找的结果

name : 对标签名称的检索字符串，输入想要检索的标签名
attrs: 对标签属性值的检索字符串，可标注属性检索，输入’class’对应的属性
recursive: 是否对子孙全部检索，默认True，若为False则只检索儿子节点
string: <>…</>中字符串区域的检索字符串

代码示例：

### 1 name
## 1.1 可以用标签名检索
>>>soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

#也可以采用正则表达式
>>>import re        #re是正则表达式库
>>>for tag in soup.find_all(re.compile('b')):#正则表达式re.compile('word')是说包含'word'的任何字符串
		    print(tag.name)
#输出结果为 'b'和'body'均为标签名   
body
b

### 2 attrs
#例如对如下<a>标签
>>>soup('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

## 增加attrs属性后，再进行检索，默认会检索class='py1'的<a>标签
>>>soup('a','py1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

>>>soup('a',re.compile('py'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

### 3 recursive
>>>soup.find_all('a')#在所有子孙节点中查找v<a>
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>>soup.find_all('a',recursive=False)#仅在儿子节点中查找标签<a>
[]

### 4 也可以用id,string等来检索
>>>soup(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

>>>soup(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>>soup(string=re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']

(2) 扩展方法

方法	说明
<>.find()	搜索且只返回一个结果，同.find_all()参数
<>.find_parents()	在先辈节点中搜索，返回列表类型，同.find_all()参数
<>.find_parent()	在先辈节点中返回一个结果，同.find()参数
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_next_sibling()	在后续平行节点中返回一个结果，同.find()参数
<>.find_previous_siblings()	在前序平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_previous_sibling()	在前序平行节点中返回一个结果，同.find()参数

三、爬取并处理信息实例

步骤1：从网络上获取大学排名网页内容 getHTMLText( )
步骤2：提取网页内容中信息到合适的数据结构 fillUnivList( )
步骤3：利用数据结构展示并输出结果 printUnivList( )

#CrawUnivRankingA.py
import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):#对类型作判断，只留下标签类型'bs4.element.Tag'
            tds = tr('td')   #将所有td标签取出到列表tds
            ulist.append([tds[0].string, tds[1].string, tds[3].string])#将td标签中的大学名称、位置等存入

def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u=ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))
    
def main():
    uinfo = []
    url = 'https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()

输出结果中文对齐不整齐，优化后代码如下：

#CrawUnivRankingB.py
import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))
    
def main():
    uinfo = []
    url = 'https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()