爬虫第二讲:Beautiful Soup库

第二讲 Beautiful Soup库

一、Beautiful Soup库基础

1.示例引入

#首先爬取下页面
>>>import requests   
>>>r = requests.get('https://python123.io/ws/demo.html')
>>>r.status_code
200
>>>demo = r.text
>>>print(demo)
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>
</body></html>

#再利用BeautifulSoup处理
>>>from bs4 import BeautifulSoup
>>>soup = BeautifulSoup(demo,'html.parser')
>>>print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

BeautifulSoup库主要操作为是两行代码

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data<p>','html.parser') #利用BeautifulSoup()解析,有两个参数
#参数'<p>data<p>'指的是html类型的信息
#参数'html.parser'是一个解析器

2.BeautifulSoup基本元素

(1) HTML和BeautifulSoup

BeautifulSoup对应一个HTML/XML文档的全部内容,建立BeautifulSoup的两种方法:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data<p>','html.parser')
soup2 = BeautifulSoup(open('D://demo.html'),'html.parser')

在这里插入图片描述

解析器有四种:

解析器使用方法条件
bs4的HTML解析器BeautifulSoup(mk,‘html.parser’)安装bs4库
lxml的HTML解析器BeautifulSoup(mk,‘lxml’)pip install lxml
lxml的XML解析器BeautifulSoup(mk,‘xml’)pip install lxml
html5lib的解析器BeautifulSoup(mk,‘html5lib’)pip install html5lib

HTML 标签参考手册

经过BeautifulSoup处理之后,每一种html的tag(标签)都有soup.tag属性与之对应

当文档中有多个同一种tag标签时,只会返回对一个tag标签的内容

(2) BeautifulSoup类的五种基本元素

基本元素说明
Tag标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾,用soup.可以提取出来相应标签的内容
Name标签的名字,

的名字是’p’,格式:.name
Attributes标签的属性,字典形式组织,每一个标签都有零或多个属性,格式:.attrs
NavigableString标签内非属性字符串,<>…</>中字符串,格式:.string
Comment标签内字符串的注释部分,一种特殊的Comment类

可以用type(soup.<tag>)查看元素的类型

元素示意图:

在这里插入图片描述

# Tag 返回标签中所有内容
>>>soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

# Name 返回标签名字
>>>soup.a.name
'a'
>>>soup.a.parent.name
'p'

# Attributes 返回字典类型,所以还可以继续索引
>>> soup.a.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>>soup.a.attrs['class']
['py1']
>>>type(soup.a.attrs)
<class 'dict'>

# NavigableString 返回该标签下的字符串
>>>soup.p.string
'The demo python introduces several python courses.'
>>>type(soup.p.string)
<class 'bs4.element.NavigableString'>

#Comment 注释类型,用<tag>.string获取字符串时,注释不会被筛掉,也会被获取,并赋予Comment类型

3.HTML内容的3种遍历方法

HTML基本格式是一个树形结构:

在这里插入图片描述

HTML树形结构有三种遍历方式(遍历顺序不同)

  • 下行遍历
  • 上行遍历
  • 平行遍历

(1) 下行遍历

属性说明
.contents子节点的列表类型,将所有儿子节点存入列表
.children子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
# 遍历方法
>>>soup.body
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body>

# 通过.contents方法可以获取tag的所有儿子标签
>>>soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']

# .contents获取的是列表类型,可以索引
>>>soup.body.contents[3]
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

# 常用遍历模板
# 1 遍历儿子标签
for child in soup.body.children:
		print(child)
# 2 遍历所有子孙标签
for child in soup.body.children:
		print(child)

(2) 上行遍历

属性说明
.parent节点的父亲标签(中的内容)
.parents节点的先辈标签的迭代类型,用于循环遍历先辈节点
# 上行遍历代码
soup = BeautifulSoup(demo,'html.parser')
for parent in soup.a.parents:
		if parent is None: 
				print(parent)
		else:
				print(parent.name)
# 输出结果
p
body
html
[document]

(3) 上行遍历

在这里插入图片描述

属性说明
.next_sibling返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling返回按照HTML文本顺序的上一个平行节点标签
.next_siblings迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings迭代类型,返回按照HTML文本顺序的前续所有平行节点标签
#已知soup.a.parent为:
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> 
and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.
</p>

# 1 遍历<a>的后续节点
for sibling in soup.a.next_siblings:
	print(sibling)
# 输出结果
and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

# 2 遍历<a>的前续节点
for sibling in soup.a.previous_siblings:
	print(sibling)
# 输出结果
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

所有在标签树中,字符类型也是节点,如上例中的andPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:是两个字符型节点。

4.HTML的格式化输出

.prettify()方法可以让HTML页面更友好地输出:该方法会在每个标签和字符串(也相当于节点)后面加一个换行符’\n’,这样可以更清晰地打印出来,示例如下:

>>>soup.prettify()
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'

>>>print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

也可以单独对某一标签做相关处理:

>>>print(soup.a.prettify())
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

二、信息标记与提取方法

1.三种信息标记方式

标记方法特点比较应用
XML使用标签组织信息,类似于HTML最早的通用信息标记语言,可扩展性好,但繁琐Internet上的信息交互与传递
JSON有效的键值对 key:value;可以用字典嵌套或用键值对列表;不能注释信息有类型,适合程序处理(js),较XML简洁移动应用云端和节点的信息通信,无注释
YAML无类型键值对;缩进表示所属关系;-表示并列关系;#表示注释信息无类型,文本信息比例最高,可读性好各类系统的配置文件,有注释易读

三种标记形式实例如下:

# XML
<person>
	<firstName>Tian</firstName>
	<lastName>Song</lastName>
	<address>
			<streetAddr>中关村南大街5号</streetAddr>
			<city>北京市</city>
			<zipcode>100081</zipcode>
	</address>
	<prof>Computer System</prof><prof>Security</prof>
</person>
# JSON
{
	“firstName” : “Tian” ,
	“lastName” : “Song” ,
	“address” : {
						“streetAddr” : “中关村南大街5号” ,
						“city” : “北京市” ,
						“zipcode” :100081} ,
	“prof” : [ “Computer System” , “Security” ]
}
#YAML
firstName : Tian
lastName : Song
address :
					streetAddr : 中关村南大街5city : 北京市
					zipcode : 100081
prof :
‐Computer System
‐Securit

2.内容查找

(1) .find_all()方法

<tag>.find_all(name, attrs, recursive, string, **kwargs)返回一个列表类型,存储查找的结果

  • name : 对标签名称的检索字符串,输入想要检索的标签名
  • attrs: 对标签属性值的检索字符串,可标注属性检索,输入’class’对应的属性
  • recursive: 是否对子孙全部检索,默认True,若为False则只检索儿子节点
  • string: <>…</>中字符串区域的检索字符串

代码示例:

### 1 name
## 1.1 可以用标签名检索
>>>soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

#也可以采用正则表达式
>>>import re        #re是正则表达式库
>>>for tag in soup.find_all(re.compile('b')):#正则表达式re.compile('word')是说包含'word'的任何字符串
		    print(tag.name)
#输出结果为 'b'和'body'均为标签名   
body
b
### 2 attrs
#例如对如下<a>标签
>>>soup('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

## 增加attrs属性后,再进行检索,默认会检索class='py1'的<a>标签
>>>soup('a','py1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

>>>soup('a',re.compile('py'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
### 3 recursive
>>>soup.find_all('a')#在所有子孙节点中查找v<a>
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>>soup.find_all('a',recursive=False)#仅在儿子节点中查找标签<a>
[]
### 4 也可以用id,string等来检索
>>>soup(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

>>>soup(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

>>>soup(string=re.compile('python'))
['This is a python demo page', 'The demo python introduces several python courses.']

(2) 扩展方法

方法说明
<>.find()搜索且只返回一个结果,同.find_all()参数
<>.find_parents()在先辈节点中搜索,返回列表类型,同.find_all()参数
<>.find_parent()在先辈节点中返回一个结果,同.find()参数
<>.find_next_siblings()在后续平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_next_sibling()在后续平行节点中返回一个结果,同.find()参数
<>.find_previous_siblings()在前序平行节点中搜索,返回列表类型,同.find_all()参数
<>.find_previous_sibling()在前序平行节点中返回一个结果,同.find()参数

三、爬取并处理信息实例

步骤1:从网络上获取大学排名网页内容 getHTMLText( )
步骤2:提取网页内容中信息到合适的数据结构 fillUnivList( )
步骤3:利用数据结构展示并输出结果 printUnivList( )

#CrawUnivRankingA.py
import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):#对类型作判断,只留下标签类型'bs4.element.Tag'
            tds = tr('td')   #将所有td标签取出到列表tds
            ulist.append([tds[0].string, tds[1].string, tds[3].string])#将td标签中的大学名称、位置等存入

def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u=ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))
    
def main():
    uinfo = []
    url = 'https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()

输出结果中文对齐不整齐,优化后代码如下:

#CrawUnivRankingB.py
import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],chr(12288)))
    
def main():
    uinfo = []
    url = 'https://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值