Python网络爬虫BeautifulSoup库和Re库入门

最新推荐文章于 2022-12-24 11:23:20 发布

Divine0

最新推荐文章于 2022-12-24 11:23:20 发布

阅读量1.7k

点赞数 1

分类专栏： Python网络爬虫与信息提取文章标签： python 正则表达式数据分析

本文链接：https://blog.csdn.net/Divine0/article/details/105422579

版权

Python网络爬虫与信息提取专栏收录该内容

6 篇文章 10 订阅

订阅专栏

1 Beautiful Soup库入门

1.1 Beautiful Soup库安装

pip install beautifulsoup4

Beautiful Soup库的安装测试：

import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
form bs4 import BeautifulSoup #从bs4中引入BeautifulSoup类
soup = BeautifulSoup(demo, "html.parser")
print(soup.prettify())

1.2 Beautiful Soup库的基本元素

Beautiful Soup库的引用：
Beautiful Soup库，也叫beautifulsoup4或bs4

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")

Beautiful Soup类的基本元素：

基本元素	说明
Tag	标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾
Name	标签的名字，<p>…</p>的名字是’p’，格式：<tag>.name
Attributes	标签的属性，字典形式组织，格式：<tag>.attrs
NavigableString	标签内非属性字符串，<>…</>中字符串，格式：<tag>.string
Comment	标签内字符串的注释部分，一种特殊的Comment类型

tag = soup.a
type(tag)
# <class 'bs4.element.Tag'>
type(tag.attrs)
# <class 'dict'>
type(soup.p.string)
# <class 'bs4.element.NavigableString'>
newsoup = BeautifulSoup('<b><!--This is a comment--></b><p>This is not a comment</p>', "html.parser')
type(newsoup.b.string)
# <class 'bs4.element.Comment'>
type(newsoup.p.string)
# <class 'bs4.element.NavigableString'>

1.3 基于bs4库的HTML内容遍历方法

标签树的下行遍历：

属性	说明
.contents	子节点的列表，将<tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

# 遍历儿子节点
for child in soup.body.children
	print(child)
# 遍历子孙节点
for child in soup.body.descendants
	print(child)

标签树的上行遍历：

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

soup = BeautifulSoup(demo,"html.parser")
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# 输出结果：
# p
# body
# html
# [document]

标签树的平行遍历：

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

# 遍历后续节点
for sibling in soup.a.next_siblings
	print(sibling)
# 遍历前续节点
for sibling in soup.a.previous_siblings
	print(sibling)

1.4 基于bs4库的HTML格式输出

bs4库的prettify()方法：

soup = BeautifulSoup(demo,"html.parser")
print(soup.a.prettify())

1.5 基于bs4库的HTML内容查找方法

方法	说明
<>.find_all(name,attrs,recursive,string,**kwargs)	返回一个列表类型，存储查找的结果

参数说明：

参数	说明
name	对标签名称的检索字符串
attrs	对标签属性值的检索字符串，可标注属性检索
recursive	是否对子孙全部检索，默认为True
string	<>…</>字符串区域的检索字符串

# name:对标签名称的检索字符串
soup.find_all('a')
soup.find_all(['a', 'b'])
soup.find_all(True) #返回soup的所有标签信息
for tag in soup.find_all(True):
    print(tag.name) #html head title body p b p a a
#输出所有b开头的标签，包括b和body    
#引入正则表达式库
import re
for tag in soup.find_all(re.compile('b')):
    print(tag.name) #body b

# attrs:对标签属性值的检索字符串，可标注属性检索
soup.find_all('p', 'course')
soup.find_all(id='link1')
import re 
soup.find_all(id=re.compile('link'))

# recursive:是否对子孙全部检索，默认为True
soup.find_all('p', recursive = False)

#string:<>...</>字符串区域的检索字符串
soup.find_all(string = "Basic Python")
import re
soup.find_all(string = re.compile('Python'))
# 简写形式：<tag>(..) = <tag>.find_all(..)
# 		   	soup(..) = soup.find_all(..)

扩展方法：

方法	说明
<>.find()	搜索且只返回一个结果，字符串类型，参数同.find_all()
<>.find_parents()	在先辈节点中搜索，返回列表类型，参数同.find_all()
<>.find_parent()	在先辈节点中返回一个结果，字符串类型，参数同.find_all()
<>.find_next_siblings()	在后续平行节点中搜索，返回列表类型，参数同.find_all()
<>.find_next_sibling()	在后续平行节点中返回一个结果，字符串类型，参数同.find_all()
<>.find_previous_siblings()	在前续平行节点中搜索，返回列表类型，参数同.find_all()
<>.find_previous_sibling()	在前续平行节点中返回一个结果，字符串类型，参数同.find_all()

1.6 Beautiful Soup库实战之中国大学排名定向爬虫

功能描述：

输入：大学排名URL链接
输出：大学排名信息的屏幕输出（排名，大学名称，总分）
技术路线：requests-bs4
定向爬虫：仅对输入URL进行爬取，不拓展爬取

程序的结构设计：

步骤1：从网络上获取大学排名网页内容
getHTMLText()
步骤2：提取网页内容中信息到合适的数据结构
fillUnivList()
步骤3：利用数据结构展示并输出结果
printUnivList()

初步代码编写:

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout= 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名", "学校名称", "分数"))
    for i in range(num):
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2]))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20) #20 univs
main()

中文输出对齐问题：

当输出中文的宽度不够时，系统会采用西文字符填充，导致对齐出现问题。

:	<填充>	<对齐>	<宽度>	,	<精度>	<精度>
引导符号	用于填充的单个字符	<左对齐 >右对齐 ^居中对齐	槽的设定输出宽度	数字的千位分隔符适用于整数和浮点数	浮点数小数部分的精度或字符串的最大输出长度	整数类型b,c,d,o,x,X浮点数类型e,E,f,%

可以使用中文空格chr(12288)填充解决。

代码优化：

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout= 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名", "学校名称", "分数",chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2],chr(12288)))

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20) #20 univs

main()

2 Re库入门

2.1 正则表达式的概念

正则表达式：

通用的字符串表达框架
简介表达一组字符串的表达式
针对字符串表达“简洁”和“特征”思想的工具
判断某字符串的特征归属

正则表达式在文本处理中十分常用：

表达文本类型的特征（病毒、入侵等）
同时查找或替换一组字符串
匹配字符串的全部或部分

正则表达式的使用：

编译：将符合正则表达式语法的字符串转换成正则表达式特征

# 'PN' 'PYN' 'PYTN' 'PYTHN' 'PYTHON'
# 正则表达式
# P(Y|YT|YTH|YTHO)?N
regex = 'P(Y|YT|YTH|YTHO)?N'
# 编译
p = re.compile(regex)

2.2 正则表达式的语法

正则表达式的常用操作符：

操作符	说明	实例
.	表示任何单个字符
[ ]	字符集，对单个字符给出取值范围	[abc]表达式a、b、c,[a-z]表示a到z单个字符
[^ ]	非字符集，对单个字符给出排除范围	[^abc]表示非a或b或c的单个字符
*	前一个字符0次或无限次扩展	abc* 表示 ab、abc、abcc、abccc等
+	前一个字符1次或无限次扩展	abc+ 表示 abc、abcc、abccc等
?	前一个字符0次或1次扩展	abc？表示 ab、abc
\|	左右表达式任意一个	abc
{m}	扩展前一个字符m次	ab{2}c表示abbc
{m,n}	扩展前一个字符m至n次（含n）	ab{1,2}c表示abc、abbc
^	匹配字符串开头	^abc表示abc且在一个字符串的开头
$	匹配字符串结尾	abc$表示abc且在一个字符串的结尾
( )	分组标记，内部只能使用\|操作符	(abc)表示abc，{abc\|def}表示abc、def
\d	数字，等价于[0-9]
\w	单词字符，等价于[A-Za-z0-9_]
\s	空格字符（可能是空格、制表符、其他空白）

经典正则表达式实例:

正则表达式	说明
^[A-Za-z]+$	由26个字母组成的字符串
^[A-Za-z0-9]+$	由26个字母和数字组成的字符串
^-?\d+$	整数形式的字符串
^[0-9][1-9][0-9]$	正整数形式的字符串
[1-9]\d{5}	中国境内邮政编码，6位
[\u4e00-\u9fa5]	匹配中文字符
\d{3}-\d{8}\|\d{4}-\d{7}	国内电话号码

2.3 Re库的基本使用

正则表达式的表示类型：

raw string类型（原生字符串类型）：当正则表达式包含转义符‘\’时使用raw string

re库采用raw string类型表示正则表达式，表示为：r’text’
例如：r’[1-9]\d{5}’
r’\d{3}-\d{8}|\d{4}-\d{7}’

string类型，更繁琐

Re库主要功能函数：

函数	说明
re.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
re.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
re.findall()	搜索字符串，以列表类型返回全部能匹配的子串
re.split()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
re.finditer()	搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象
re.sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

1. re.search(pattern,string,flags=0)

在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
pattern：正则表达式的字符串或原生字符串表示；
string：待匹配字符串；
flags：正则表达式使用时的控制标记。

常用标记	说明
re.I re.IGNORECASE	忽略正则表达式的大小写，[A-Z]能匹配小写字符
re.M re.MUTILINE	正则表达式中的^操作符能够将给定字符串的每行当做匹配开始
re.S re.DOTALL	正则表达式中的.操作符能够匹配所有字符，默认匹配除换行符外的所有字符

程序实例：

import re
match = re.search(r'[1-9]\d{5}','BIT 100081')
if match:
    print(match.group(0))  
# 100081

2. re.match(pattern,string,flags=0)

从一个字符串的开始位置起匹配正则表达式，返回match对象
pattern：正则表达式的字符串或原生字符串表示；
string：待匹配字符串；
flags：正则表达式使用时的控制标记。

程序实例：

import re
match = re.match(r'[1-9]\d{5}','BIT 100081')
if match:
    print(match.group(0))  
# AttributeError
match = re.match(r'[1-9]\d{5}','100081 BIT')
if match:
    print(match.group(0))  
# '100081'

3. re.findall(pattern,string,flags=0)

搜索字符串，以列表类型返回全部能匹配的子串
pattern：正则表达式的字符串或原生字符串表示；
string：待匹配字符串；
flags：正则表达式使用时的控制标记。

程序实例：

import re
ls = re.findall(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(ls) 
# ['100081', '100084']

4. re.split(pattern,string,maxsplit=0,flags=0)

将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
pattern：正则表达式的字符串或原生字符串表示；
string：待匹配字符串；
maxsplit：最大分割数，剩余部分作为最后一个元素输出；
flags：正则表达式使用时的控制标记。

程序实例：

import re
ls = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(ls) 
# ['BIT', ' TSU', '']
ls2 = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1)
print(ls2) 
# ['BIT', ' TSU10084']

5. re.finditer(pattern,string,flags=0)

搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素都是match对象
pattern：正则表达式的字符串或原生字符串表示；
string：待匹配字符串；
flags：正则表达式使用时的控制标记。

程序实例：

import re
for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):
    if m:
        print(m.group(0)) 
# 100081 
# 100084

6. re.sub(pattern,repl,string,count=0,flags=0)

在一个字符串中替换所有匹配正则表达式的子串，并返回替换后的字符串
pattern：正则表达式的字符串或原生字符串表示；
repl：替换匹配字符串的字符串；
string：待匹配字符串；
count：匹配的最大替换次数；
flags：正则表达式使用时的控制标记。

程序实例：

import re
rst = re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT 100081,TSU 100084')
print(rst) 
# 'BIT :zipcode TSU :zipcode'

2.4 Re库的另一种等价用法

# 函数式用法：一次性操作
rst = re.search(r'[1-9]\d{5}', 'BIT 100081')

# 面向对象用法：编译后的多次操作
pat = re.compile(r'[1-9]\d{5}')
rst = pat.search('BIT 100081')

regex = re.compile(pattern,flags=0)

将正则表达式的字符串形式编译成正则表达式对象
pattern：正则表达式的字符串或原生字符串表示；
flags：正则表达式使用时的控制标记。

函数	说明
regex.search()	在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象
regex.match()	从一个字符串的开始位置起匹配正则表达式，返回match对象
regex.findall()	搜索字符串，以列表类型返回全部能匹配的子串
regex.split()	将一个字符串按照正则表达式匹配结果进行分割，返回列表类型
regex.finditer()	搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象
regex.sub()	在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串

2.5 Re库的match对象

Match对象的属性：

属性	说明
.string	待匹配的文本
.re	匹配时使用的pattern对象（正则表达式）
.pos	正则表达式搜索文本的开始位置
.endpos	正则表达式搜索文本的结束位置

Match对象的方法：

方法	说明
.group(0)	获得匹配后的字符串
.start()	匹配字符串在原始字符串的开始位置
.end()	匹配字符串在原始字符串的结束位置
.span()	返回(.start(),.end())

程序实例：

import re
m = re.search(r'[1-9]\d{5}', 'BIT100081 TSU100084')
print(m.string) 
# BIT100081 TSU100084
print(m.re) 
# re.compile('[1-9]\\d{5}')
print(m.pos) 
# 0
print(m.endpos) 
# 19
print(m.group(0)) 
# '100081' 
print(m.start()) 
# 3
print(m.end()) 
# 9
print(m.span()) 
# (3, 9)

2.6 Re库的贪婪匹配和最小匹配

贪婪匹配：

Re库默认采用贪婪匹配，即输出匹配最长的子串

import re
match = re.search(r'PY.*N', 'PYANBNCNDN')
print(match.group(0)) 
# PYANBNCNDN

最小匹配：

import re
match = re.search(r'PY.*?N', 'PYANBNCNDN')
print(match.group(0)) 
# PYAN

最小匹配操作符：

操作符	说明
*?	前一个字符0次或无限次扩展，最小匹配
+?	前一个字符1次或无限次扩展，最小匹配
??	前一个字符0次或1次扩展，最小匹配
{m,n}?	扩展前一个字符m至n次（含n），最小匹配

2.7 Re库实战之淘宝商品比价定向爬虫

功能描述：

目标：获取淘宝搜索页面的信息，提取其中的商品名称和价格
理解：淘宝的搜索接口
翻页的处理
技术路线：requests-re

程序的结构设计：

步骤1：提交商品搜索请求，循环获取页面
步骤2：对于每个页面，提取商品的名称和价格信息
步骤3：将信息输出到屏幕上

import requests
import re

def getHTMLText(url):
    #浏览器请求头中的User-Agent，代表当前请求的用户代理信息（下方有获取方式）
    headers={
                "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36",
                "cookie": "自己浏览器的信息"
            }
    try:
        r = requests.get(url, headers=headers, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

#解析请求到的页面，提取出相关商品的价格和名称
def parsePage(ilt, html):
    try:
        plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price = eval(plt[i].split(':')[1])
            title = eval(tlt[i].split(':')[1])
            ilt.append([price, title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))


def main():
    goods = '书包'
    depth = 2 #爬取深度，2表示爬取两页数据
    start_url = 'https://s.taobao.com/search?q=' + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + '&s=' + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

main()

注：淘宝网站本身有反爬虫机制，所以在使用requests库的get()方法爬取网页信息时，需要加入本地的cookie信息，否则无法获取数据。具体做法是用浏览器打开淘宝页面登录后搜索“书包”，按F12打开开发者工具，在出现的窗口上方点击network，刷新淘宝页面，然后单击请求的url（一般是第一个），接着在右侧Headers中找到Request Headers，在Request Headers中找到cookie字段，放到代码相应位置即可。

2.8 Beautiful Soup库与Re库配合实战之股票数据定向爬虫

功能描述：

目标：获取上交所和深交所所有股票的名称和交易信息
输出：保存到文件中
技术路线：requests-bs4-re

候选数据网站的选择：

新浪股票：https://finance.sina.com.cn/stock/
百度股票：https://gupiao.baidu.com/stock/
选取原则：股票信息静态存在于HTML页面中，非js代码生成，没有Robots协议限制。

程序的结构设计：

步骤1：从东方财富网获取股票列表
步骤2：根据股票列表逐个到百度股票获取个股信息
步骤3：将结果存储到文件

初步代码编写：

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue
 
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

代码优化：

速度提高：编码识别的优化
体验提高：增加动态进度显示

import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url, code="utf-8"): 
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code # 编码识别的优化
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="") # 增加动态进度显示
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
            continue
 
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()