Python网络爬虫与信息提取（北京理工大学慕课）学习笔记2

最新推荐文章于 2023-03-23 21:24:53 发布

陆空生

最新推荐文章于 2023-03-23 21:24:53 发布

阅读量666

点赞数

分类专栏：学习笔记文章标签： python html

本文链接：https://blog.csdn.net/weixin_43754153/article/details/105604973

版权

学习笔记专栏收录该内容

14 篇文章 2 订阅

订阅专栏

Python网络爬虫与信息提取（基础篇二）

Beautiful Soup库入门

Beautiful Soup库的安装

pip install beautifulsoup4

小测Beautiful Soup库是否安装成功
获取https://python123.io/ws/demo.html该网页的源代码

>>> import requests
>>> r=requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

Beautiful Soup库使用方法：

from bs4 import BeautifulSoup
soup=BeautifulSoup('<p>data</p>','html.parser') 
#BeautifulSoup为一个类
#'<p>data</p>'为待解析的html数据，'html.parser'为解析器

Beautiful Soup库的基本元素

HTML文件看作是标签树
Beautiful Soup库是解析、遍历、维护“标签树”的功能库
Beautiful Soup库，也叫beautifulsoup4 或 bs4

from bs4 import BeautifulSoup

import bs4

HTML⬅➡标签树⬅➡BeautifulSoup类
BeautifulSoup对应一个HTML/XML文档的全部内容
Beautiful Soup库的解析器：

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,‘html.parser’)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,‘lxml’)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,‘xml’)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,‘html5lib’)	pip install html5lib

Beautiful Soup类的基本元素：

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，< p>…</ p>的名字是’p’，格式：< tag>.name
Attributes	标签的属性，字典形式组织，格式：< tag>.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：< tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

>>> soup.title#获取html文件中的title标签
<title>This is a python demo page</title>
>>> tag=soup.a#获取html文件中的a标签
>>> tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

>>> soup.a.name#获取a标签的名字
'a'
>>> soup.a.parent.name#获取a标签的父标签的名字
'p'
>>> soup.a.parent.parent.name#获取p标签的父标签的名字
'body'

>>> tag =soup.a
>>> tag.attrs#获得a标签的属性
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>> tag.attrs['class']#获得class键值对的值
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'

>>> type(tag.attrs)
<class 'dict'>#标签属性为字典，若属性为空，也是一个空字典
>>> type(tag)
<class 'bs4.element.Tag'>

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string#获得a标签中的字符串
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string#获得p标签中的字符串
'The demo python introduces several python courses.'#不包含P标签中的b标签
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>#可见，NavigableString可以跨标签

>>> newsoup=BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")
>#<!-- -->中是注释
>>> newsoup.b.string
'This is a comment'
>>> type(newsoup.b.string)#b标签中是一段注释，但是获取b标签的字符串时并不会提示
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

基于bs4库的HTML内容的遍历方法

HTML是具有树形结构的文本信息
3种遍历方式：
1.从根节点到叶节点的下行遍历方式
2.从叶节点到根节点的上行遍历方式
3.平行遍历方式
标签树的下行遍历：

属性	说明
.contents	子节点的列表，将< tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点\
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents#不仅包括标签节点，还包括字符串节点
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>

>>> for child in soup.body.children:#遍历儿子节点
...     print(child)
...


<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>


>>> for descendant in soup.body.descendants:#遍历子孙节点
...     print(descendant)
...


<p class="title"><b>The demo python introduces several python courses.</b></p>
<b>The demo python introduces several python courses.</b>
The demo python introduces several python courses.


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
Basic Python
 and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
Advanced Python

标签树的上行遍历：

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

>>> for parent in soup.a.parents:
...     if parent is None:
...             print(parent)
...     else:
...             print(parent.name)
...
p
body
html
[document]
#soup本身的parent不存在，所以代码中要加一条如果parent为None的情况

标签树的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

平行遍历的条件：平行遍历发生在同一个父节点下的各节点间

基于bs4库的HTML格式化和编码

基于bs4库的HTML格式输出

让HTML更加“友好”地显示

>>> soup.prettify()#加了换行符
'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())
<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

>>> print(soup.a.prettify())#单独对一个标签
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

bs4库将读入的HTML和字符串都转化成UTF-8编码

信息组织与提取方法

信息标记的三种形式

信息的标记：

标记后的信息可形成信息组织结构，增加了信息维度
标记后的信息可用于通信、存储和展示
标记的结构与信息一样具有重要价值
标记后的信息更利于程序理解和运用

信息标记的三种形式：XML, JSON,YAML

XML :

（与HTML很接近） extensible markup language扩展标记语言，以标签为主构建信息

< name>…</ name> #标签中有内容
< name />#标签中没有内容
< !-- -->#注释

JSON :

javascript object notation， js语言中面向对象的信息表现形式
有类型的键值对 key:value 构建的信息表现形式（字符串、数字等）

“city” ： “南京”
“city” : [“南京”，“上海”] #多值
“city” : {“newName” : “南京” ,
“oldName” : “金陵”
} #键值对嵌套用

YAML:

YAML Ain’t Markup Language
无类型键值对 key:value（字符串）

name : 南京
#缩进表示所属关系
name:    
	oldName: 金陵
    newName:南京
#-表达并列关系
name:	
-南京
-金陵
#  |表达整块数据
#  #表示注释
text:|
南京，简称“宁”，古称金陵、建康，是江苏省会、副省级市、特大城市、南京都市圈核心城市，国务院批复确定的中国东部地区重要的中心城市、全国重要的科研教育基地和综合交通枢纽。截至2018年，全市下辖11个区，总面积6587平方千米，建成区面积971.62平方千米。2019年，常住人口850.0万人，城镇人口707.2万人，城镇化率83.2%。

三种信息标记形式的比较

在这里插入图片描述

XML：最早的通用信息标记语言，可扩展性好，但繁琐
Internet上的信息交互与传递
JSON：信息有类型，适合程序处理(js)，较XML简洁
移动应用云端和节点的信息通信，无注释
YAML：信息无类型，文本信息比例最高，可读性好
各类系统的配置文件，有注释易读

信息提取的一般方法

方法一：完整解析信息的标记形式，再提取关键信息。需要标记解析器例如：bs4库中的标签树遍历。（信息解析准确，但提取过程繁琐，速度慢）
方法二：无视标记形式，直接搜索关键信息。对信息的文本查找函数即可。（提取过程简洁，速度快。但提取结果准确性与信息内容相关）
融合方法：结合形式解析与搜索方法，提取关键信息。需要标记解析器及文本查找函数
例如：提取HTML中所有URL链接
1)搜索到所有< a>标签
2)解析< a>标签格式，提取href后的链接内容

>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

基于bs4库的HTML内容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)
返回一个列表类型，存储查找的结果
name:对标签名称的检索字符串
attrs:对标签属性值的检索字符串，可标注属性检索
recursive: 是否对子孙全部检索，默认为True
string: <>...</>中字符串区域的检索字符串

name

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> for tag in soup.find_all(True):
...     print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>> import re #正则表达式库
>>> for tag in soup.find_all(re.compile('b')):
...     print(tag.name)
...
body
b

attrs

>>> soup.find_all('p','course')
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
>>> soup.find_all(id='link')
[]
>>> import re
>>> soup.find_all(id=re.compile('link'))
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

recursive

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]

string

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string="Basic Python")
['Basic Python']
>>> soup.find_all(string=re.compile("python"))
['This is a python demo page', 'The demo python introduces several python courses.']

< tag>(…)等价于< tag>.find_all(…)
即， soup(…)等价于 soup.find_all(…)

扩展方法

方法	说明
<>.find()	搜索且只返回一个结果，同find_all()参数
<>.find_parents()	在先辈节点中搜索，返回列表类型，同find_all()参数
<>.find_parent()	在先辈节点中返回一个结果，字符串类型，同find_all()参数
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，同find_all()参数
<>.find_next_sibling()	在后续平行节点中返回一个结果，字符串类型，同find_all()参数
<>.find_previous_siblings()	在前续平行节点中搜索，返回列表类型，同find_all()参数
<>.find_previous_sibling()	在前续平行节点中返回一个结果，字符串类型，同find_all()参数

实例：中国大学排名定向爬虫

输入：大学排名的URL链接
输出：大学排名信息的屏幕输出（排名，大学名称，总分）

import requests
from bs4 import BeautifulSoup
import bs4


def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""


def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])
    pass


def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名", "学校名称", "总分"))
    for i in range(num):
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2]))


def main():
    ufo = []
    url = "http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html"
    html = getHTMLText(url)
    fillUnivList(ufo, html)
    printUnivList(ufo, 20)  # 20 univs


main()

    排名    	 学校名称 	    总分    
    1     	 清华大学 	   94.6   
    2     	 北京大学 	   76.5   
    3     	 浙江大学 	   72.9   
    4     	上海交通大学	   72.1   
    5     	 复旦大学 	   65.6   
    6     	中国科学技术大学	   60.9   
    7     	华中科技大学	   58.9   
    7     	 南京大学 	   58.9   
    9     	 中山大学 	   58.2   
    10    	哈尔滨工业大学	   56.7   
    11    	北京航空航天大学	   56.3   
    12    	 武汉大学 	   56.2   
    13    	 同济大学 	   55.7   
    14    	西安交通大学	   55.0   
    15    	 四川大学 	   54.4   
    16    	北京理工大学	   54.0   
    17    	 东南大学 	   53.6   
    18    	 南开大学 	   52.8   
    19    	 天津大学 	   52.3   
    20    	华南理工大学	   52.0

网页链接：http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html
观察该网页的html源代码：
发现每所大学的信息在< tr>标签对中，而每项具体信息又在< tb>标签对中

陆空生

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python网络爬虫与信息提取（北京理工大学慕课）学习笔记2

Beautiful Soup库入门Beautiful Soup库的安装pip install beautifulsoup4小测Beautiful Soup库是否安装成功获取https://python123.io/ws/demo.html该网页的源代码>>> import requests>>> r=requests.get("http://py...
复制链接

扫一扫