python数据分析第二章——正则表达式()、匿名函数，partial柯里化、迭代器（iter yield）、生成器表达式

最新推荐文章于 2020-12-09 13:51:58 发布

G8_xixi

最新推荐文章于 2020-12-09 13:51:58 发布

阅读量160

点赞数

分类专栏： [一起啃书]python数据分析

本文链接：https://blog.csdn.net/weixin_42277616/article/details/107811670

版权

[一起啃书]python数据分析专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

Re函数
Re符号
修饰符
正则表达式例子
匿名函数，partial部分参数的应用
Generators 生成器
Generator expresssions 生成器的表达式

Re函数

1.re.match 匹配开头

import re

index=re.match('what','whatff i whatfffff')

if index:
    print(index.start()) ## 返回起始位置
    print(index.end()) ## 返回结束位置3+1 = 不匹配的f开始位置4
    print(index.span()) ## 返回（起始，结束）
    print(index.group(0))## 返回字符串

0
4
(0, 4)
what

2.re.search 全文匹配，只返回一个成功对象

import re

list_not_in_set = ['abc?123','book','change','aaaa']

for value in list_not_in_set:  
    index_not_in_set = re.search('[^a]',value)
    if index_not_in_set:
        print(index_not_in_set.group(0))

b
b
c

关于匹配的返回值

start() 返回匹配开始的位置

end() 返回匹配结束的位置

span() 返回一个元组包含匹配 (开始,结束) 的位置，span就是范围的意思life span寿命

group() 返回被 RE 匹配的字符串, 重点看pattern,判定有几组

3. findall 返回列表

import re
 
p_findll = re.compile(r'\d+')   # 查找数字
result1 = p_findll.findall('123abc456')
# 找数字，返回一个列表
result2 = p_findll.findall('123abc456', 3, 8)
# 从3位开始，包括a，从8位结束，不包括6
 
print(result1)
print(result2)

['123', '456']
['45']

4.re.compile 编译正则

r代表普通的\字符而不是一个转义符号

import re

pattern = re.compile(r'\d+') # 1或多个数字

m = pattern.match('one12twothree34four')  # 查找头部，没有匹配
n = pattern.search('one12twothree34four').group(0)

print(m)
print(n)

None
12

5. sub 替换删除

import re

s_sub = "123 abc 456 456 456" # string字符串
p_sub = '456' # pattern 匹配的字符串
r_sub = '789' # replace替换的

s_subed = re.sub(p_sub, r_sub, s_sub, count=1, flags=0)
print("count = 1:", s_subed)
# count = 1 匹配后替换一次

s_subed_ed = re.sub(p_sub, r_sub, s_sub, count=0, flags=0)
print("count = 0:", s_subed_ed)
# count = 0 匹配后替换次数不限

print(s_subed_ed)

count = 1: 123 abc 789 456 456
count = 0: 123 abc 789 789 789
123 abc 789 789 789

6.finditer 返回迭代器

import re
 
it = re.finditer(r"\d+","123abc456efg789") 

for match in it: 
    print (match.group())

123
456
789

7.re.split 分割返回列表

re.split('\W+', '，runoob, runoob,    runoob.')

# \W非字母数字及下划线
# 也就是字母数字下划线留着 ／， 一个或多个空格见到也分隔 ／. 也不能要，见到分隔，分隔一次，列表里就有一个元素，就有一个，
# 所以开头结尾都有个空

['', 'runoob', 'runoob', 'runoob', '']

8.(?P…) 分组匹配

import re

s = '1102231990xxxxxxxx'

res = re.search('(?P<province>\d{3})(?P<city>\d{3})(?P<born_year>\d{4})',s)

print(res.groupdict())

{'province': '110', 'city': '223', 'born_year': '1990'}

import re
 
# 将匹配的数字乘以 2

def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)
 
s = 'A23G4HFD567'
s_2 = re.sub('(?P<value>\d+)', double, s)

print(s_2)

A46G8HFD1134

Re符号

符号及意义
.：Any character (except \n newline) 任何字符，除了换行
{}：Explicit quantifier notation 明确数量
[]：Explicit set of characters to match 一系列符号，或的含义
()：Logical grouping of part of an expression 一个表达部分的逻辑分组
* ：0 or more of previous expression 0个或多个前面的字符，贪婪模式
？：0 or 1 of previous expression 0个或多个前面的字符，非贪婪模式
+：1 or more of previous expression 1个或多个前面的字符
\转义符号，Preceding one of the above, it makes it a literal instead of a special character. 让上面的符号变为简单的字符，而不是特殊的功能符号

^：Start of a string 字符串的开头
$：End of a string 字符串的结尾

## 贪婪和非贪婪 0或多个  0或1个
import re

list_more = ['acd','abbb','123abbbb','123abb123']

print('ab*贪婪模式：')
for value in list_more:
    index_more_1 = re.search('ab*',value)    
    if index_more_1:
        print(index_more_1.group(0))

print('\nab*?非贪婪模式：')        
for value in list_more:    
    index_more_3 = re.search('ab*?',value)    
    if index_more_3:
        print(index_more_3.group(0))        

ab*贪婪模式：
a
abbb
abbbb
abb

ab*?非贪婪模式：
a
a
a
a

符号及含义
\d十进制数字，相当于[0-9]，其他进制不行
\D非数字，相当于[^0-9]，其他进制也不包括
\w任何词汇的字符，相当于[a-zA-Z_0-9]
\W非词汇字符，相当于[^a-zA-Z_0-9].
\s所有的空白字符换页，换行，回车，Tab，纵向Tab 。相当于[ \f\n\r\t\v].
\S所有的非空白字符，相当于[^ \f\n\r\t\v].

import re

list_any_word = ['abc','123','ABC','!#','*&']

for value in list_any_word:  
    index_any_word = re.search('\w',value)
    if index_any_word:
        print(index_any_word.group(0))

a
1
A

修饰符

关于flag，修饰符，共6个

re.I 不考虑大小写

re.L 本地化识别匹配？？

re.M 多行匹配，影响^he $

re.S 影响.,包含换行符

re.U 根据Unicode字符解析，影响\w,\W,\b,\B

re.X …跟利于理解

正则表达式例子

一个小例子

states = ['   ?alabama!  ', 'FlOrIda', 'south  carolina##', 'West virginia?']

取出空格，移除标点符号，调整适当的大小写

import re

def clean_strings(strings):
    result = []
    for value in strings:
        print('0 : ' + value)
        value = value.strip() ##删除开头和结尾的空格换行
        print('1 strip: ' + value)
        value = re.sub('[!#?]', '', value) # 将！或# 或？替换为空，即删除
        print('2 re: '+ value)
        value = value.title()
        print('3 title: '+ value)
        result.append(value)
        
    return result

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) 

## re.M：多行匹配
## re.I：不考虑大小写

if matchObj:
    
    print("matchObj.group() : ", matchObj.group())
    # 整体匹配的字符串：什么 空格are空格 什么 空格 什么
    print("matchObj.group(1) : ", matchObj.group(1))
    # 从开头C到空格are空格，取group（1）
    print("matchObj.group(2) : ", matchObj.group(2))
    # 从are空格之后，也就是s到空格.*取group（2）
    # 空格.* 是 空格than dogs
    # 所以group（2）= smarter
else:
    print ("No match!!")

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

参考网址

匿名函数，partial部分参数的应用

其实就是从已有函数衍生新的函数

## 把两个数字加载一起
def add_numbers(x, y): 
    return x + y

## 改编为只有y一个参数的函数
add_five = lambda y: add_numbers(5, y)

## 注意这是个函数，调用它
add_five(8)

Generators 生成器

迭代器的优势

比如要找第4位，索引有点像整个for循环，从0开始，找到4结束

迭代器有点像for循环里的一步，走到第三步，下面就是第四步

索引每次都从0开始数

就因为0数会有多余的步骤降低了程序性能

比如有个 100w个元素的list 如果用索引每次都从0位开始数

不是很浪费时间吗

索引就是记住最开始位置迭代就是记住了当前操作到的位置

创建一个生成器，只需将函数的return换成yiel

def squares(n=10):
    print('Generating squares from 1 to {0}'.format(n ** 2))
    for i in range(1, n + 1):
        yield i ** 2

gen = squares()
gen

<generator object squares at 0x00000132FE5552A0>

for x in gen:
    print(x, end=' ')

Generating squares from 1 to 100
1 4 9 16 25 36 49 64 81 100

def yield_test(n):
    for i in range(n):
        yield change(i) 
        ## 功能生成一串鞭炮，每个小鞭儿调用一次change
        ## 并保存，注意保存了
        #print("i = ",i)
    print("end.")

def change(i):
    return i*2

print(yield_test(5)) # 生成了一串鞭炮，change（0），change（1）...change(5)

for i in yield_test(5):#使用for循环点鞭炮
    print(i, "in for")

sum(yield_test(6)) # 用sum点鞭炮

<generator object yield_test at 0x00000132FE555480>

0 in for
2 in for
4 in for
6 in for
8 in for
end.

end.
30

Generator expresssions 生成器的表达式

列表推导式的[ ]换为( )
itertools module 模块，groupby
排列组合 product
都是只可用一次，放完一次鞭炮就没啦

from itertools import permutations
test = permutations([1, 2, 3, 4], 3)
for n in test:
    print(n)

from itertools import combinations
test = combinations([1, 2, 3, 4], 3)
for n in test:
    print(n)