正则表达式基础

最新推荐文章于 2023-07-15 15:38:26 发布

mockingbirds

最新推荐文章于 2023-07-15 15:38:26 发布

阅读量2.6k

点赞数

分类专栏： python 文章标签：正则表达式 python

本文链接：https://blog.csdn.net/mockingbirds/article/details/72078407

版权

python 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

正则表达式是学习python爬虫的必要条件，所以需要先做好准备打好表达式的基础，开始吧

# -*- coding: utf-8 -*-

import re

line = "helloworld123"
# ^表示以什么开头
# .表示任意字符
# *表示一个字符可以重复任意零次或多次
# $符号表示结尾字符

regexStr = "^h.*3$"

if re.match(regexStr, line):
    print("yes")  # 输出yes

? 表示非贪婪匹配

先来看看什么是贪婪匹配，正则表达式默认是从最后面开始匹配的

# 提取两个h中间的字符串
line = "heeeeeeeellohhaa"
regexStr = ".*(h.*h).*"  # 括号表示我们要提取的字符串
match_obj = re.match(regexStr, line)

if match_obj:
    print(match_obj.group(1))  # 输出hh，默认从最后面匹配

可以看到正则表达式，默认是从最右边匹配的，我们可以使用? 设置从最左边开始匹配，非贪婪匹配，及匹配到第一个符合条件的为止，否则默认匹配到最后一个

line = "heeeeeeeellohhaa"
regexStr = ".*?(h.*?h).*"  #? 设置从最左边开始匹配，非贪婪匹配，及匹配到第一个符合条件的为止，否则默认匹配到最后一个
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))  # 输出 heeeeeeeelloh

+ 表示至少出现一次

line = "heeeeeeeellohhhdhaa"
regexStr = ".*(h.+h).*"  # 括号表示我们要提取的字符串
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))  # hdh 可以看到默认从右边匹配的

{2} 表示字符串出现的个数

line = "heeeeeeeellohhhdhaa"
#regexStr = ".*(h.{3}h).*"  # {3} 表示中间出现的字符个数
#regexStr = ".*(h.{4,}h).*"  # {3,} 表示中间出现的字符个数是4次或4次以上
regexStr = ".*(h.{2,4}h).*"  # {2,4} 表示中间出现的字符个数最少2次最多4次
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))

| 表示或的关系

line = "helloworld123"
regexStr = "(helloworld123|hello)"
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))  # helloworld123

[] 表示匹配中括号内的任意字符

[] 表示匹配中括号内的任意字符，需要注意的是在[]里的所有正则表达式字符，都是没有特殊含义的

line = "18700987865"
regexStr = "(1[48357][0-9]{9})"
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))  # 18700987865

\s 表示匹配一个空格

\S表示匹配一个非空格的字符

line = "hello world"
regexStr = "(hello\sworld)"
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))  # hello world

[\u4E00-\u9FA5] 表示匹配的字符是汉字

\d 表示匹配的是数字

line = "xxx出生于1992年"
regexStr = ".*?(\d+)年"
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))  # 1992

出生日期匹配

line = "xxx出生于1992年3月22日"
# line = "xxx出生于1992/3/22"
# line = "xxx出生于1992-3-22"
# line = "xxx出生于1992-03-22"
# line = "xxx出生于1992-03"
regexStr = ".*出生于(\d{4}[年/-]\d{1,2}([月/-]\d{1,2}|[月/-]$|$))"
match_obj = re.match(regexStr, line)
if match_obj:
    print(match_obj.group(1))

mockingbirds

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
正则表达式基础

正则表达式是学习python爬虫的必要条件，所以需要先做好准备打好表达式的基础，开始吧# -*- coding: utf-8 -*-import reline = "helloworld123"# ^表示以什么开头# .表示任意字符# *表示一个字符可以重复任意零次或多次# $符号表示结尾字符regexStr = "^h.*3$"if re.match(regexStr, line):
复制链接

扫一扫