regrex 正则表达式

w55100

已于 2024-03-18 11:32:12 修改

阅读量816

点赞数

分类专栏： python fe 文章标签：正则表达式 javascript python

于 2023-05-21 01:24:39 首次发布

本文链接：https://blog.csdn.net/w55100/article/details/130787507

版权

python 同时被 2 个专栏收录

25 篇文章 3 订阅

订阅专栏

9 篇文章 0 订阅

订阅专栏

前言

熟练掌握正则表达式。

基本语法

1.基础字符

\d 匹配任何十进制数字，等价于字符类 [0-9] 。
\D 匹配任何非数字字符，等价于字符类 [^0-9] 。
\s 匹配任何空白字符，等价于字符类 [ \t\n\r\f\v] 。
\S 匹配任何非空白字符，等价于字符类 [^ \t\n\r\f\v] 。
\w 匹配任何字母与数字字符，等价于字符类 [a-zA-Z0-9_] 。
\W 匹配任何非字母与数字字符，等价于字符类 [^a-zA-Z0-9_] 。

\n匹配一个换行符
\t 匹配一个制表符
^ 匹配一行字符串的开头
$ 匹配一行字符串的结尾

2. 字符类

方括号对[]规定了字符类(class)，
或者称为字符集合(set，注意集合这个名词是不正规的，虽然语义上更容易理解)。

字符类有常见的约定如

a-z
A-Z
0-9

也可以自定义字符类

[abc9] 表示只寻找这四个字符之一。

在字符类开头使用^ 表示“非”。

[^abc9] 表示寻找不是这四个字符的其它字符。

3.字符类的重复

* 0次或多次。贪婪重复。
+ 1次或多次。
? 0次或1次。
{m,n} 给定[m,n]次

后接? 可以将* 变成非贪婪模式。写作*?。

[abc9]+ 表示可以用这四个字符重复1或更多次。

4.或运算

逻辑运算

| 或

[abc]+|[456]+ 表示由 {a,b,c}中的一个或多个组成，或者由{4,5,6}中的一个或多个组成。

python

import re

match() 确定正则是否从字符串的开头匹配。
search() 扫描字符串，查找此正则匹配的任何位置。
findall() 找到正则匹配的所有子字符串，并将它们作为列表返回。
finditer() 找到正则匹配的所有子字符串，并将它们返回为一个 iterator。

1. match 与 fullmatch

re.match() 为开头匹配。

如果 string 开始的0或者多个字符匹配到了正则表达式样式，就返回一个相应的匹配对象。如果没有匹配，就返回 None ；注意它跟零长度匹配是不同的。

返回 re.Match对象。

re.fullmatch() 为全文匹配。

全文匹配失败时，返回None。

2.search

局部匹配，或称为子串匹配。

扫描整个字符串找到匹配样式的第一个位置，并返回一个相应的匹配对象。如果没有匹配，就返回一个None ；注意这和找到一个零长度匹配是不同的。

返回 re.Match对象。

示例

content = "./output/compound_60.sdf"
pattern = r"[\S]*.sdf"
result = re.search(pattern,content)
print(result)
print(result.group())
print(result.span())

输出

<re.Match object; span=(0, 24), match='./output/compound_60.sdf'>
./output/compound_60.sdf
(0, 24)

同span

result.start()
result.end()

如果找不到满足规则的子串，就返回空字符串。（注意不是None，和match家族不同）

content = "./output/compound_60.sdf"
pattern = r"[a-z]*"
result = re.search(pattern,content)
print(result)
print(result.group())
print(result.span())

输出

<re.Match object; span=(0, 0), match=''>

(0, 0)

证明是空字符串。

print(result.group()=='')
True
print(result.group() is None)
False

2. findall()

尽力的局部匹配，或称为子串匹配。
返回list

print(re.findall(pattern,content))
#输出
["./output//compound_60.sd"]

3.replace

re.sub(pattern, repl, string, count=0, flags=0)
re.subn

4.组提取

上面几个东西都只能check到模式子串。
但我们可能只需要这个模式子串的某个part。
这就要用到提取功能。

圆括号对指定一个匹配组()，group。
使用 findall() 来提取指定组的内容。

找不到匹配内容时，返回空列表[]

注意单组匹配在某些情况下是“懒惰的”。
因为模式匹配总是尽可能多地满足先出现的 * 算符。

content = "./output/compound_60.sdf"
pattern = r"[\S]*(\d*).sdf"
print(re.findall(pattern,content))
# 输出
['']

content = "./output/compound_60.sdf"
pattern = r"[\S]*(\d+).sdf"
print(re.findall(pattern,content))
# 输出
['0']

上述样例中，(\d)+ 出现得较晚，先满足前一个*，所以后者显得懒惰了。
这种情况下必须人为干涉匹配逻辑。

content = "./output/compound_60.sdf"
pattern = r"[\S]*_(\d+).sdf"
print(re.findall(pattern,content))
# 输出
['60']

或者使用?将 *变成非贪婪模式。

content = "./output/compound_60.sdf"
pattern = r"[\S]*?(\d+).sdf"
print(re.findall(pattern,content))
# 输出
['60']

双组匹配，第一个组是贪婪的，第二个组是懒惰的。

content = "./output/compound_60.sdf"
pattern = r"([\S]*)(\d+).sdf"
print(re.findall(pattern,content))
# 输出
[('./output/compound_6', '0')]

content = "./output/compound_60.sdf"
pattern = r"([\S]*)(\d*).sdf"
print(re.findall(pattern,content))
# 输出
[('./output/compound_60', '')]

猜想：末位组总会以最懒惰的方式获得内容。

注意， python中，re.match() 对圆括号是不敏感的。只能用findall达成组提取。
这点不同于js。

Ref:

https://docs.python.org/zh-cn/3/howto/regex.html

Javascript

在js里的正则通常有2种使用场景，

string.match(regrex)，子串匹配，返回Array。
string.matchAll(global_regrex)，必须是g修饰的，返回迭代器。
string.search(regexp)
string.replace，替换。

或者

regrex.exce(string) 等价于 string.match(regrex)
regrex.test(string)

1.match 子串匹配+组提取

js里的match，类似于python里的 findall。
有子串匹配+组提取的功能。

但不同于python里的match。python中match是开头匹配，而js中的match是子串匹配，更接近于python中的re.search()。

返回格式

第0项，匹配到的子串。
第1项开始，组提取的内容。有几个组提取就跟几项。

match与matchAll的主要区别就在于是否delegate。
match匹配到第一个就返回了。

str1 = 'compound_60.sdf'
console.log([...str1.matchAll(/[\S]*_(\d+).sdf/g)])
// ['compound_60.sdf', '60', index: 0, 
// input: 'compound_60.sdf', groups: undefined]

console.log(str1.match(/[\S]*_(\d+).sdf/))
//['compound_60.sdf', '60', index: 0, 
//  input: 'compound_60.sdf', groups: undefined]

但你不能说match是懒惰的，match对于?等符号，总是尽可能地满足。

2. 全文匹配

js中不存在内置的全文匹配。
也不存在任何指定位置的匹配。（比如python中match代表的开头匹配）

想达成全文匹配的功能只能配合使用^与$符号。
类似地想达成指定位置的匹配只能自己使用其他符号。
虽然麻烦一点，但positional matching还是可以做到的。
使用 ^[\S]{n,n}很容易指定开始位置。

str1 = '/aa/compound_60.sdf'
console.log(str1.match(/^\/[a-z]+\/[a-z]+_(\d+).sdf$/))

//['/aa/compound_60.sdf', '60', index: 0, 
// input: '/aa/compound_60.sdf', groups: undefined]