【python学习8】字符串处理,数据验证和处理,正则表达式

香蕉你个banna

已于 2022-07-30 18:39:24 修改

阅读量813

点赞数 2

分类专栏： python学习文章标签： python 学习 pycharm

于 2022-07-22 12:39:54 首次发布

本文链接：https://blog.csdn.net/qq_56769994/article/details/125918128

版权

python学习专栏收录该内容

13 篇文章 6 订阅

订阅专栏

（一）字符串处理的相关方法

注意是相关方法,不是函数

#大小写转换
s1="HelloWorld"
new_s1=s1.lower()    #全部转为小写，并产生新的字符串赋给new_s1
new__s1=s1.upper()    #全部转为大写，并产生新的字符串赋给new__s1

#字符串的分割
s2="cxk@163.com"
new_s2=s2.split("@")    #分割之后是列表类型
print("邮箱名",new_s2[0],"邮箱域名",new_s2[1])

#统计子串在指定字符串的次数
print(s1.count("o"))

#检索操作
print(s1.find("o"))    #"o"首次出现的位置索引
print(s1.index("o"))

print(s1.find("p"))    #没找到，故结果为-1
print(s1.find("p"))    #该语句会报错，ValueError,因为没找到

#判断前缀和后缀
print(s1.startswith("H"))    #输出True
print(s1.startswith("p"))    #输出False

print("demo.py".endswith(".py"))    #输出True

s="HelloWorld"
#替换
new_s=s.replace("0","你好")    #替换后产生新的字符串，赋给new_s

#字符串在指定宽度范围内居中
print(s.center(20,*))

#去除字符串的左右空格
s1="   Hello   World  "
print(s.strip())    #默认去除空格，去除之后会产生新的字符串
print(s.lstrip())    #去除左侧空格
print(s.rstrip())    #去除右侧空格

#去除指定的字符，与指定字符的顺序无关
s3="dl-helloworld"
print(s3.strip("ld"))    #输出是"-hellowor",删除指定的“字符”，与“ld”和“dl”的顺序无关

（二）格式化字符串

#(1)使用占位符格式化字符串
name="马冬梅"
age=18
score=98.3
print("姓名:%s,年龄:%d,成绩:%f" %(name,age,score))    #注意写法，后面是%元组形式
print("姓名:%s,年龄:%d,成绩:%.2f" %(name,age,score))    #score的输出保留两位小数

#(2)使用f-string格式化字符串
print(f"姓名：{name},年龄：{age},成绩：{score}")    #""之前必须要有参数f,否则就是简单的字符串，且以{}表明被替换的字符串

#(3)使用字符串的format()方法
print("姓名:{0},年龄:{1},成绩:{2}".format(name,age,score))#注意前面花括号{}中的数字对应format()中的参数位置(位置从0开始排)，卡槽对应相应位置

print("姓名:{2},年龄:{0},成绩:{1}".format(age,score,name))

#方法format()的格式控制，前三位控制
s="helloworld"
print("{0:*<20}".format(s))    #0对应format()中的s(卡槽对应位置)，接着按照标准的格式控制写

#格式控制中的千位分隔符,(只适用整数和浮点数)
print("{0:,}".format(9873256))    #输出9,873,256
print("{0:,}".format(9873256.236))    #输出9,873,256.236

#浮点数小数部分的精度
print("{0:.2f}".format(3.1415926535))    #输出3.14,是四舍五入的保留
#或者字符串的最大显示长度
print("{0:.5}".format("helloworld"))    #输出hello,输出宽度为5

#最后一位控制:类型控制
#整数类型
a=425
print("{0:b},{0:c},{0:d},{0:o},{0:x},{0:X}".format(a))#模板中的卡槽是0所以format()只需要一个参数
#类型c是对应unicode字符，类型x对应16进制小写字符，类型X对应16进制大写字符

#浮点数类型
b=3.141592
print("{0:.2f},{0:.2E},{0:.2e},{0:.2%}".format(b))
#E和e是科学计数法，%是以百分数形式输出且以百分数保留小数

（三）字符串的编码与解码

errors有三个可选：

strict指严格操作，不符合直接报错

ignore指忽略错误

replace指替换，不认识直接替换成 "?" 号

s="伟大的中国"
#编码
scode=s.encode("gbk")
s_utf-8=s.encode("utf-8")

#解码
print(bytes.decode(scode,"gbk"))
print(bytes.decode(s_utf-8,"utf-8"))    #必须这种写法?,不能写成scode.decode()吗?

(四)数据的验证

str.isdigit(): 验证所有字符都是 十进制的阿拉伯数字

str.isnumeric(): 验证所有字符都是数字,包括阿拉伯数字，罗马数字，汉字大写数字(壹贰)，汉字数字(一二)，二进制数字不行

str.isalpha(): 判断所有字符都是字母(包括中文)，中文数字也行，但阿拉伯数字不行

str.isalnum: 判断所有字符都是数字或字母(包括中文)，中文数字也行，阿拉伯数字也行

str.islower(),str.isupper(): 判断所有字符都是大写或小写,由于中文不分大小写，故只判断英文字母的大小写

str.istitle(): 单词之间用空格分割，否则认为是一个单词，由于中文不分大小写，故只判断英文字母的大小写

#判断所有字符都是数字(十进制的阿拉伯数字)
print("123".isdigit())    #True
print("一二三".isdigit())    #False
print("0b1001".isdigit())    #False
print("IIIIII".isdigit())    #False罗马数字
print("壹贰叄"isdigit())    #False

#判断所有字符都是数字(罗马数字，十进制阿拉伯，汉字数字(一二三和壹贰叄))
print("123".isnumeric())    #True
print("一二三".isnumeric())    #True
print("0b1001".isnumeric())    #False
print("IIIIII".isnumeric())    #True 罗马数字
print("壹贰叄"isnumeric())    #True

#判断都是字母(英文中文)
print("hello你好".isalpha())    #True
print("hello你好123".isalpha())    #False
print("hello你好一二三".isalpha())    #True
print("hello你好IIIIII".isalpha())    #False

#判断所有字符都是数字和字母(英文和中文)
print("hello你好123".isalnum())    #True
print("hello你好123...".isalnum())    #Falde
print("hello你好一二三".isalnum())    #True
print("hello你好壹贰叄".isalnum())    #True
print("hello你好IIIIII".isalnum())    #True

#判断首字母大写
print("Hello".istitle())    #True
print("HelloWorld".istitle())    #False
print("Helloworld".istitle())    #True
print("Hello world".istitle())    #False
print("Hello World".istitle())    #True
print("Hello你好".istitle())    #True



#判断所有字符都是小写
print("Hello".islower())    #False
print("hello".islower())    #True
print("hello你好".islower())    #True,因为中文没有大小写，故只判断字母大小写
#同理可判断所有字符都是大写

#判断是否都是空白字符
print("\t".isspace())      #True  
print("\n".isspace())     #True
print("    ".isspace())     #True

（五）数据处理

1，字符串的拼接操作

使用“+”拼接 和 使用join()方法拼接 是最常用的，join()方法也用于 列表和元组的拼接

s1="hello"
s2="world"

#(1)使用“+”拼接
print(s1+s2)

#(2)使用join()方法拼接,使用列表进行拼接
print("".join["hello","world"])    #对空字符串使用join()方法，输出"hello world"
#join()方法是在列表中的每个字符串添加一个符号拼接在一起
print("*".join["hello","world","php"])    #输出“hello*world*php”,首个字符串没有符号

#(3)直接拼接
print("hello""world")

#(4)使用格式化字符串拼接
print("%s%s" %(s1,s2))    #输出“helloworld”
print(f"{s1}{s2}")    #输出“helloworld”
print("{0}{1}".format(s1,s2))    #输出“helloworld”

2，字符串的去重

s="alknvlakgakdvnpawirugsagkj"
#(1)使用for循环和not in方法去重
new_s1=""
for item in s:
    if item not in new_s1: #判断s中的字符是否在new_s1中存在(或者说是重复)
        new_s1+=item    #因为s和new_s1都是字符串，所以“+”进行拼接操作
print(new_s1)

#(2)使用索引，range()函数，for循环，not in
new_s2=""
for i in range(len(s)):
    if s[i] not in new_s2:    #索引元素进行判断
        new_s2+=s[i]
print(new_s2)

#(3)通过集合去重+列表排序+join()方法拼接
new_s3=set(s)    #new_s3是集合类型
lst=list(new_s3)    #转为列表
lst.sort(key=s.index)    #sort()方法排序，使用参数key(指定比较排序的键)
print("".join(lst))    #join()方法拼接

3，列表元素的去重

lst=["金星","木星","水星","火星","土星","金星","木星","水星","火星","土星"]
new_lst=[]
#(1)用for循环遍历+ not in
for item in lst:
    if item not in new_lst:

        new_lst.append(item)  #添加item到new_lst中

#(2)for+range()+not in
new_lst2=[]
for i in range(len(lst)):
    if lst[i] not in new_lst2:
        new_lst2.append(lst[i])

#(3)利用集合去重，转为列表排序(用key参数)
s_lst=set(lst)
new_lst3=list(s_lst)
new_lst3.sort(key=lst.index)

(六)正则表达式

1，正则表达式的初次认识

（七）内置模块re的使用(还是处理字符串)

①pattern是模式字符串,也可以说 匹配规则，string是待匹配的字符串，flag是标志位(控制匹配的方式，例如是否区分大小写，是否利用多行模式)

②用前面的 pattern匹配string,是否满足规则

③pattern 是 按照前面正则表达式的格式书写

#re模块的使用，导入re模块
import re
pattern=r"\d\.\d+"    #r表示元字符，指python中的转义字符不起作用
s="i study python everyday"
match=re.match(pattern,s,re.I)#参数re.I是忽略大小写
print(match)    #输出是None

s2="3.10python i study"
match2=re.match(pattern,s2,re.I)
print(match2)    #输出<re.match.object>
print("匹配的起始位置",match2.start())    #输出0
print("匹配的结束位置"，match2.end())    #输出4，因为在位置4没匹配到，故结束
print("匹配的位置区间",match2.span)    #输出（0.4）
print("待匹配的字符串",match2.string)    #输出3.10python i study，即match方法中的string 
print("匹配的数据",match2.group())    #输出3.10

findall()方法的结果是列表，如果没有匹配项则结果为空列表

import re
pattern=r"\d\.\d+"
s="i study python 3.10 every day python2.1 i love u"
s2="4.10python i study"
s3="i study python every day"
match=re.search(pattern.s)    #3.10
match2=re.search(pattern.s2)    #4.10
match3=re.search(pattern.s3)    #None

lst1=re.findall(pattern,s)    #["3.10","2.1"]
lst2=re.findall(pattern,s2)    #["4.10"]
lst3=re.findall(pattern,s3)    #[]是空列表

re.sub()方法的结果是 字符串， re.split方法的结果是列表

import re
pattern="黑客|破解|反爬"
s="我想学python,像破解VIP视频，python无限反爬"
new_s=re.sub(pattern,"***",s)    #替换后的结果是字符串
print(new_s)    #s中符合pattern的替换为***

s2="https://www.baidu/s?wd=cij&ie=utf-8&tn=baidu"
pattern2="[?|&]"
lst=re.split(pattern2,s2)
print(lst)    #结果是列表

部分实战代码

#车牌归属地
lst=["京A446262","粤C562394"，"津B123965"]
for item in lst:
    s=item[0:1]
    print(item,"归属地",s)

#统计字符串中出现指定字符的次数,只能统计字符不能统计字符串
s="Hellopython,Hellojava,hellophp"
word=input("要统计的字符")
print("{0}在{1}中出现的次数{2}".format(word,s,s.upper().count(word))

#格式化输出商品的名称和单价
lst=[
    ["01","电风扇","美的",500],
    ["02","洗衣机","TCL",1000],
    ["03","微波炉","老板",400]
]
print("编号\t\t名称\t\t品牌\t\t价格")
for item in lst:
    for i in item:
        print(i,end="\t\t")
    print()
#对列表内容格式化输出
for item in lst:
    item[0]="000"+item[0]
    item[3]="${:.2f}".format(item[3])

#正则表达式提取有效数据
import re
s="akjfbakjsfkx cjefsdncskdjnjnskdjf"#总之就是从网上复制来的一大串字符，会包含网站信息

pattern="https://img\d{1}.baidu.com/it/u=\d*,\d*&fm=\d*&fmt=auto"
lst=re.findall(pattern,s)  #结果是列表
for item in lst:
    print(item)