机器学习-python语言基础第十一天-CSDN博客

本文链接：https://blog.csdn.net/lyckid/article/details/101827888

文章目录

正则表达式

概念

正则表达式（Regular Expression）是一种文本模式，描述在搜索文本时要匹配的一个或多个字符串。通常正则表达式被应用在：

数据验证
文本扫描
文本提取
文本内容替换
文本分割

语法

正则表达式的语法可以分为两类。

字面值

字面值包括普通字符，在使用普通字符时，可以使用[]做一个范围匹配。[0-9a-zA-Z_]可以匹配一个数字、字母或者下划线；
字面值还包括转义字符：

正则转义写法	匹配字符
\\	\
\^	^
\$	$
\.	.
\\|	\|
\*	*
\+	+
\[\]	[]
\{\}	{}

元字符

python内置了很多元字符，具有特殊含义，可以匹配多类数据，具体的元字符如下：

元字符	含义
.	除\n外的所有字符
\d	匹配一个数字，相当于[0-9]
\D	匹配所有的非数字，相当于[^0-9]
\s	匹配所有的空白字符 \t\n\r\f\v
\S	匹配所有非空白字符 [^\t\n\r\f\v]
\w	字符数字下划线 [0-9a-zA-Z_]
\W	非字符数字下划线 [0-9a-zA-Z_ ]

python在匹配规则上，还有其他的一些特点：

批量备选，A|B可以匹配A或B
正则表达式可以在元字符或普通字符后加上表示数量的量词。？： 0或1次、* ：0或多次、+1或多次、特定 {n，m} 范围次数、{n，}至少n次，{，m}最多m次、{n} n次。
贪婪与非贪婪。贪婪模式时python默认的，默认尽量匹配最大的范围结果。非贪婪尽量匹配最小的范围结果。非贪婪需要在量词后添加，？？、*？、+?..
边界匹配。边界匹配规则： ^ 行首、$ 行尾、 \b 单词边界、 \B 非单词边界、 \A 输入开头、
\Z 输入结尾。
注：或因上下文差异有不同表现

python正则语法

模式对象
模式对象（RegexObject）现为编译后的正则表达式（编译为字节码并缓存）。模式对象通常通过re模块的compile函数来编写。在字符串前加r取消字符串的转义，但不能取消正则表达式的转义。
模式对象有有几个主要的方法，这几个方法在re模块中也有。

findall() 查找所有非重叠匹配项，返回list
match(string[,pos[endpos]]) 从起始位置开始匹配，返回MatchObject。
search(string[,pos[endpos]]) 从任意位置开始搜索，返回MatchObject。
funditer() 查找所有匹配项，返回MactchObject对象的迭代器。

findall方法的例子：

>>> import re
>>> text = 'Tom is 8 years old. Mike is 25 years old.'
>>> pattern = re.compile('\d+')
>>> pattern.findall(text)
['8', '25']
>>> re.findall('\d+',text)
['8', '25']
>>> s = '\\author:Tom'
>>> s
'\\author:Tom'
>>> print(s)
\author:Tom
>>> patter=re.compile('\\author')
>>> patter.findall(s)
patter = re.compile('\\\\author')
>>> patter.findall(s)
['\\author']
>>> pattern=re.compile(r'\\author')
>>> pattern.findall(s)
['\\author']
>>> text = 'Tom is 8 years old. Mike is 25 years old.Peter is 87 years old'
>>> pattern=re.compile(r'\d+')
>>> pattern.findall(text)
['8', '25', '87']
>>> pattern = re.compile(r'[A-Z]\w+')
>>> pattern.findall(text)
['Tom', 'Mike', 'Peter']

match方法的例子：

>>> pattern = re.compile(r'<html>')
>>> text = '<html><head></head><body></body></html>'
>>> pattern.match(text)
<_sre.SRE_Match object; span=(0, 6), match='<html>'>
>>> text2='a<html><head></head><body></body></html>'
>>> pattern.match(text2)
>>> pattern.match(text2,1)
<_sre.SRE_Match object; span=(1, 7), match='<html>'>

search和finditer方法的例子：

>>> pattern.search(text2)
<_sre.SRE_Match object; span=(1, 7), match='<html>'>
>>> 
>>> 
>>> text = 'Tom is 8 years old. Mike is 25 years old.Peter is 87 years old'
>>> pattern=re.compile(r'\d+')
>>> it = pattern.finditer(text)
>>> for m in it:
...  print(m)
... 
<_sre.SRE_Match object; span=(7, 8), match='8'>
<_sre.SRE_Match object; span=(28, 30), match='25'>
<_sre.SRE_Match object; span=(50, 52), match='87'>
>>> pattern = re.compile(r'(\w+) (\w+)')
>>> pattern.findall(text)
[('Beautiful', 'is'), ('better', 'than')]
>>> it = pattern.finditer(text)
>>> for m in it:
...  print(m.group())
... 
Beautiful is
better than

匹配对象
匹配对象主要用来表现被匹配的模式。
匹配对象中也有一些方法：

group() 参数为0或空返回整个匹配、有参数时返回特定分组匹配细节、参数可以是分组名称
groups 返回包含所有子分组的元组
start() 返回特定分组的起始索引
end() 返回特定分组的终止索引
span() 返回特定分组的起止索引元组
groupdict() 以字典形式返回分组名及结果

>>> pattern=re.compile(r'(\d+).*?(\d+)')
>>> pattern.search(text)
<_sre.SRE_Match object; span=(7, 30), match='8 years old. Mike is 25'>
>>> m=pattern.search(text)
>>> m.group()
'8 years old. Mike is 25'
>>> m.group(0)
'8 years old. Mike is 25'
>>> m.group(1)
'8'
>>> m.start(1)
7
>>> m.end(1)
8
>>> m.groups()
('8', '25')

Group分组
主要应用场景：从匹配模式中提取信息、创建子正则以应用量词、限制备选项范围、重用正则模式中提取的内容。

r'ab+c','ababc')
<_sre.SRE_Match object; span=(2, 5), match='abc'>
>>> re.search(r'(ab)+c','ababc')
<_sre.SRE_Match object; span=(0, 5), match='ababc'>
>>> re.search(r'Cent(re|er)','Centre')
<_sre.SRE_Match object; span=(0, 6), match='Centre'>
>>> re.search(r'Cent(re|er)','Center')
<_sre.SRE_Match object; span=(0, 6), match='Center'>
>>> re.search(r'(\w+) \1','hello world')
>>> re.search(r'(\w+) \1','hello hello world')
<_sre.SRE_Match object; span=(0, 11), match='hello hello'>

group的声明有两种，一是（模式）；二是（？P模式），给正则表达式起名字。引用时可以使用：m.group(‘name’)，模式内时：(？P=name)，表现内： \g

>>> pattern = re.compile(r'(?P<name>\w+):(?P<scores>\d+)')
>>> m = pattern.search(text)
>>> m.group()
'Tom:98'
>>> m.group(1)
'Tom'
>>> m.group('name')
'Tom'

编组还有一些字符串操作：

text = 'Beautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.'
>>> p = re.compile(r'\n')
>>> p.split(text)
['Beautiful is better than ugly.', 'Explicit is better than implicit.', 'Simple is better than complex.']
>>> re.split(r'\n',text)
['Beautiful is better than ugly.', 'Explicit is better than implicit.', 'Simple is better than complex.']
>>> re.split(r'-','Good-Morning')
['Good', 'Morning']
>>> re.split(r'(-)','Good-Morning')
['Good', '-', 'Morning']
>>> 
>>> 
>>> text
'Beautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.'
>>> re.split(r'\n',text,2)
['Beautiful is better than ugly.', 'Explicit is better than implicit.', 'Simple is better than complex.']
>>> re.split(r'\n',text,1)
['Beautiful is better than ugly.', 'Explicit is better than implicit.\nSimple is better than complex.']
>>> 
>>> 
>>> ords ='ORD000\nORD001\nORD003'
>>> re.sub(r'\nd+','-',ords)
'ORD000\nORD001\nORD003'
>>> re.sub(r'\d+','-',ords)
'ORD-\nORD-\nORD-'
>>> 
>>> 
>>> text='Beautiful is *better* than ugly'
>>> re.sub(r'\*(.*?)\*','<strong>\g<1></strong>',text)
'Beautiful is <strong>better</strong> than ugly'
>>> re.sub(r'\*(.*?)\*','<strong></strong>',text)
'Beautiful is <strong></strong> than ugly'
>>> re.sub(r'\*(?P<html>.*?)\*','<strong>\g<html></strong>',text)
'Beautiful is <strong>better</strong> than ugly'
>>> re.sub(r'([A-Z]+)(\d+)','\g<2>-\g<1>',ords)
'000-ORD\n001-ORD\n003-ORD'
>>> re.subn(r'([A-Z]+)(\d+)','\g<2>-\g<1>',ords)
('000-ORD\n001-ORD\n003-ORD', 3)

其中\g<2>为一种表现内的方法，实在表达式内引用分组。

split（ string， maxsplit=0）：分割字符串
sub( repl, string, count = 0)：替换字符串
subn（repl，string，count=0）:替换并返回替换数量

编译标记,改变正则的默认行为:

re.I 忽略大小写
re.M 匹配多行
re.S 指定“.”匹配所有字符，包括\n

>>> re.findall(r'^<html>','\n<html>')
[]
>>> re.findall(r'^<html>','\n<html>',re.I)
[]
>>> re.findall(r'^<html>','\n<html>',re.M)
['<html>']
>>> re.findall(r'\d(.)','1\ne')
[]
>>> re.findall(r'\d(.)','1\ne',re.S)
['\n']
>>> re.purge()
>>> re.findall(r'^','^python^')
['']
>>> re.findall(re.escape(r'^'),'^python^')
['^', '^']

re.purge() 清理正则缓存，re.escape（）逃逸字符，规避字符原有的含义。

系统工具

与操作系统交互的指令，在windows上是命令提示符，在linux上是shell脚本语言，在unix上是系统管理。在python中，与系统或编译器交互的模块主要是sys和os模块。

sys提供一组功能映射Python运行的操作系统
os提供跨平台可移植的操作系统编程接口、os.path 提供文件及目录工具的可移植的编程接口。

sys模块

>>> from pprint import pprint
>>> import sys
>>> pprint(dir(sys))

>>> sys.platform
'win32'
>>> sys.version
'3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)]'
>>> sys.path          #文件搜索路径
>>> sys.modules	      #已经导入的文件模块

sys模块中的方法还可以观察异常细节：

>>> try:
...  raise KeyError
... except:
...  print(sys.exc_info())
... 
(<class 'KeyError'>, KeyError(), <traceback object at 0x000001C5ED6BB5C8>)
import traceback
>>> try:
...  raise KeyError
... except:
...  print(sys.exc_info())
...  traceback.print_tb(sys.exc_info()[2])
... 
(<class 'KeyError'>, KeyError(), <traceback object at 0x000001C5ED6C73C8>)
  File "<stdin>", line 2, in <module>

sys.exc_info（）输出最后一次异常细节,traceback 是异常细节对象。

import sys


def add(a, b):
    return a + b


if sys.argv[1]:
    a = int(sys.argv[1])

if sys.argv[2]:
    b = int(sys.argv[2])

print(add(a + b))

在命令提示符调用当前模块时，在调用的模块后添加参数，即是sys.argv列表中的第一个元素及后续元素。
sys模块还具有标准流：

sys.stdin 标准输入流默认等同于 input（）
sys.stdout 标准输出流默认等同于 print（）
sys.stderr 标准错误流

>>> import sys
>>> sys.stdout.write('hello')
hello5
>>> print('输入信息');sys.stdin.readline()[:]
输入信息
Python
'Python\n'
>>> print('输入信息');x=sys.stdin.readline()[:]
输入信息
python
>>> x
'python\n'
>>> sys.stderr.write('错误信息')
4
错误信息>>> sys.stderr.flush()  #清空输出流

os模块

>>> import os
>>> os.environ

通过environ方法可以查看当前环境变量。

>>> os.getcwd()
'D:\\Sublime Text 3'
>>> os.listdir()
['changelog.txt', 'crash_reporter.exe', 'date.txt', 'date2.txt', 'msvcr100.dll', 'Packages', 'plugin_host.exe', 'python3.3.zip', 'python33.dll', 'subl.exe', 'sublime.py', 'sublime_plugin.py', 'sublime_text.exe', 'unins000.dat', 'unins000.exe', 'unins000.msg', 'update_installer.exe']
>>> os.chdir('Packages')
>>> os.getcwd()
'D:\\Sublime Text 3\\Packages'
>>> os.getpid()
10480
>>> os.getppid()
7244

上述代码是管理工具，其中的主要方法如下：

getcwd() 获取当前目录
listdir(path) 列举目录内容
chdir(path) 改变工作目录‘
getpid() 获取当前进程ID，可以通过任务管理器查看
getppid() 获取父进程ID

运行shell命令

system（）在python脚本中运行shell命令
popen() 运行命令并且连接输入输出流

>>> os.system('dir /b')
>>> os.system('dir')
>>> os.popen('dir /b').read()
>>> os.popen('dir /b').readlines()

os模块还具有文件处理功能：

mkdir （‘目录名’）创建目录
rmkdir （‘目录名’）删除目录
rename （‘旧名’，‘新名’）改名
remove('文件名’）删除文件

>>> os.rename('info.txt','detail.txt')
>>> os.listdir()
['changelog.txt', 'crash_reporter.exe', 'date.txt', 'date2.txt', 'detail.txt', 'msvcr100.dll', 'Packages', 'plugin_host.exe', 'python3.3.zip', 'python33.dll', 'subl.exe', 'sublime.py', 'sublime_plugin.py', 'sublime_text.exe', 'unins000.dat', 'unins000.exe', 'unins000.msg', 'update_installer.exe']
>>> os.remove('detail.txt')
>>> os.listdir()
['changelog.txt', 'crash_reporter.exe', 'date.txt', 'date2.txt', 'msvcr100.dll', 'Packages', 'plugin_host.exe', 'python3.3.zip', 'python33.dll', 'subl.exe', 'sublime.py', 'sublime_plugin.py', 'sublime_text.exe', 'unins000.dat', 'unins000.exe', 'unins000.msg', 'update_installer.exe']
>>> os.mkdir('test')
>>> os.listdir()
['changelog.txt', 'crash_reporter.exe', 'date.txt', 'date2.txt', 'msvcr100.dll', 'Packages', 'plugin_host.exe', 'python3.3.zip', 'python33.dll', 'subl.exe', 'sublime.py', 'sublime_plugin.py', 'sublime_text.exe', 'test', 'unins000.dat', 'unins000.exe', 'unins000.msg', 'update_installer.exe']
>>> os.rmdir('test')
>>> os.listdir()
['changelog.txt', 'crash_reporter.exe', 'date.txt', 'date2.txt', 'msvcr100.dll', 'Packages', 'plugin_host.exe', 'python3.3.zip', 'python33.dll', 'subl.exe', 'sublime.py', 'sublime_plugin.py', 'sublime_text.exe', 'unins000.dat', 'unins000.exe', 'unins000.msg', 'update_installer.exe']

os还可以查看可移植工具：

sep 分隔符
pathsep 路径分隔符
curdir 相对当前目录符号
pardir 相对上级目录符号

>>> os.sep
'\\'
>>> os.pathsep
';'
>>> os.curdir
'.'
>>> os.pardir
'..'

os内还内置了路径模块 path.，用来对文件目录进行操作：

isdir（path）是否目录
isfile(path) 是否文件
exist(paht) 是否存在
split（path）拆分路径
splitext(path) 拆分路径扩展名
join（）连接路径
normalpath() 规范路径
abspath() 绝对化路径

>>> os.path.isdir(r'D:\Sublime Text 3\test')
True
>>> os.path.isfile(r'D:\Sublime Text 3\sublime.py')
True
>>> os.path.exists(r'D:\Sublime Text 3\sublime.py')
True
>>> name = 'c:\data\temp\data.txt'
>>> os.path.split(r'c:\data\temp\data.txt')
('c:\\data\\temp', 'data.txt')
>>> os.path.dirname(name)
'c:\\data\temp'
>>> os.path.basename(name)
'data.txt'
>>> os.path.splitext(name)
('c:\\data\temp\\data', '.txt')
>>> os.path.join(r'c:\temp','product.csv')
'c:\\temp\\product.csv'
>>> p = 'd:\\app\\db/files/data/csv'
>>> p
'd:\\app\\db/files/data/csv'
>>> os.path.normpath(p)
'd:\\app\\db\\files\\data\\csv'
>>> 
>>> 
>>> 
>>> os.path.abspath('..')
'D:\\'
>>> os.path.abspath('.')
'D:\\Sublime Text 3'