CS50P week7 regular expressions_harvard cs50 introduction to programming with r-CSDN博客

本文链接：https://blog.csdn.net/weixin_49350481/article/details/136660838

Lecture 7 - CS50’s Introduction to Programming with Python (harvard.edu)

Notes

正则表达式

re库

validate

import re

email = input("What's your email? ").strip()

if re.search(r"^\w+@\w.+\.(com|edu|gov|net|org)$", email):
    print("Valid")
else:
    print("Invalid")

注意：

^.+@.+\.edu$,^放在前面匹配行开头,$放最后匹配行结尾,都是在“ ”中
\.表示字符意义上的.，由\取反义
"^x"与[^x]区分，后者表示匹配除x以外的字符
正则表达式中，+表示匹配前面的字符1次或多次，*表示匹配前面的字符0次或多次
\w是匹配任何字母与数字字符 <=> 字符类 [a-zA-Z0-9_]

    \d    decimal digit,匹配任何十进制数字
    \D    not a decimal digit
    \s    whitespace characters,匹配任何空白字符 <=> [ \t\n\r\f\v]
    \S    not a whitespace character
    \w    word character, as well as numbers and the underscore
    \W    not a word character
    
	A|B     either A or B
	(...)   a group
	(?:...) non-capturing version

flag

旗标	含意
`ASCII`, `A`	使几个转义如 `\w`、`\b`、`\s` 和 `\d` 匹配仅与具有相应特征属性的 ASCII 字符匹配。
`DOTALL`, `S`	使 `.` 匹配任何字符，包括换行符。
`IGNORECASE`, `I`	进行大小写不敏感匹配。
`LOCALE`, `L`	进行区域设置感知匹配。
`MULTILINE`, `M`	多行匹配，影响 `^` 和 `$`。
`VERBOSE`, `X` （为 ‘扩展’）	启用详细的正则，可以更清晰，更容易理解。

解决`malan@cs50.harvard.edu`也可以

import re

email = input("What's your email? ").strip()

if re.search(r"^\w+@(\w+\.)?\w+\.edu$", email, re.IGNORECASE):
    print("Valid")
else:
    print("Invalid")

注意：

(\w+\.)?中，？意味着括号里面的表达式可能出现一次，也可能不出现

Cleaning Up User Input

import re

name = input("What's your name? ").strip()
matches = re.search(r"^(.+), *(.+)$", name)
if matches:
    name = matches.group(2) + " " + matches.group(1)
print(f"hello, {name}")

等价于

import re

name = input("What's your name? ").strip()
if matches := re.search(r"^(.+), *(.+)$", name):
    name = matches.group(2) + " " + matches.group(1)
print(f"hello, {name}")

注意：

:=海象运算符，用于且仅用于if语句、while循环、推导式、三元表达式等的变量赋值，可以节省行数，提高效率

Extracting User Input

replace

url = input("URL: ").strip()

username = url.replace("https://twitter.com/", "")
print(f"Username: {username}")

注意：

replace(a,b)函数将把字符串a替换成b

re.sub

import re

url = input("URL: ").strip()

username re.sub(r"^(https?://)?(www\.)?twitter\.com/", "", url)
print(f"Username: {username}")

注意：

re.sub(a,b,c)会把c中包含a的字符串全用b替代
https?表示s可要可不要
也可以用re.search()

import re

url = input("URL: ").strip()

if matches := re.search(r"^https?://(?:www\.)?twitter\.com/([a-z0-9_]+)", url, re.IGNORECASE):
    print(f"Username:", matches.group(1))

注意：

前文的海象运算符
- 用 '('，')' 表示的组也捕获它们匹配的文本的起始和结束索引；
- 组从 0 开始编号，组 0 始终存在，它表示整个正则；子组从1开始，根据'('，')'从左到右区分
- 如果不想捕获入组：(?:...)

作业

numb3rs

numb3rs.py

# 判断ip
import re
import sys

def main():
    print(validate(input("IPv4 Address: ")))

def validate(ip):

    if matches:=re.search(r'^(([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])\.){3}([1-9]?\d|1\d\d|2[0-4]\d|25[0-5])$',ip):
        return True
    else:
        return False

if __name__ == "__main__":
    main()

难点：

0-255正则表达式匹配，注意是逐字符匹配，所以255要拆成4段：0-99，100-199，200-249，250-255

test_numb3rs.py

from numb3rs import validate

def test_validate():
    assert validate("127.0.0.1") == True
    assert validate("255.255.255.255") == True
    assert validate("512.512.512.512") == False
    assert validate("1.2.3.1000") == False
    assert validate("cat") == False

watch.py

import re
import sys


def main():
    print(parse(input("HTML: ")))

def parse(s):
    if matches:=re.search(r"\"https?://(?:www\.)?youtube\.com/embed/([a-z0-9_]+)",s,re.IGNORECASE):
        return f"https://youtu.be/{matches.group(1)}"

if __name__ == "__main__":
    main()

难点：

忽略大小写：re.IGNORECASE
group()

working

working.py

import re
import sys

def main():
    print(convert(input("Hours: ")))

def convert(s):
    if matches:=re.search(r"(\d|1[0-2]):?([0-5]\d)? (AM|PM) to (\d|1[0-2]):?([0-5]\d)? (AM|PM)",s):
        # 注意arr[]从arr[0]开始,而group()从group(1)开始
        arr=[]
        for i in matches.groups():
            arr.append(i)
        if arr[1]==None:
            arr[1]=0
        if arr[4]==None:
            arr[4]=0
        # 注意map时，不用matches.group()是因为其中可能有None，会导致不能强转int类型
        arr[0],arr[1],arr[3],arr[4]=map(int,[arr[0],arr[1],arr[3],arr[4]])
         
        if arr[2]=="PM":
            arr[0]+=12
        if arr[5]=="PM":
            arr[3]+=12

        return f"{arr[0]:02}:{arr[1]:02} to {arr[3]:02}:{arr[4]:02}"

    else:
        raise ValueError

if __name__ == "__main__":
    main()

难点：

12小时制转化为24小时制
- 注意matches.group(i)是不能被赋值的，所以另外定义了一个list接收并改变值
- matches.group()从1开始，arr[]从0开始
将字符串转为int：
- 注意map时，不用matches.group()是因为其中可能有None，会导致不能强转int类型("NoneType"错误)
输出格式：
- f"{arr[0]:02}"若只有一位，前面用0补齐变为两位
- 0-59分成两段：0-9，10-59

test_working.py

from working import convert
import pytest

def test_convert():
    assert convert("9 AM to 5 PM")=="09:00 to 17:00"
    assert convert("9:00 AM to 5:00 PM")=="09:00 to 17:00"
    assert convert("10 PM to 8 AM")=="22:00 to 08:00"
    assert convert("10:30 PM to 8:50 AM")=="22:30 to 08:50"
    with pytest.raises(ValueError):
        convert("9:60 AM to 5:60 PM")
    with pytest.raises(ValueError):
        convert("9 AM - 5 PM")
    with pytest.raises(ValueError):
        convert("09:00 AM - 17:00 PM")

um

um.py

import re
import sys

def main():
    print(count(input("Text: ")))

def count(s):
    if matches:=re.findall(r"\b(um)\b",s,re.IGNORECASE):
        i=len(matches)
        return i
    else:
        return 0

if __name__ == "__main__":
    main()

难点：

re.findall()：匹配所有符合条件的结果，返回一个列表
- re.search和re.match的关系如下表
- 如果没有找到匹配， match() 和 search() 返回 None 。如果它们成功，一个匹配对象实例将被返回

方法 / 属性	目的
`match()`	确定正则是否从字符串的开头匹配。
`search()`	扫描字符串，查找此正则匹配的任何位置。
`findall()`	找到正则匹配的所有子字符串，并将它们作为列表返回。
`finditer()`	找到正则匹配的所有子字符串，并将它们返回为一个 iterator

\b：字边界
- 仅在单词的开头或结尾处匹配，单词的结尾由空格或非字母数字字符表示；
- 当是完整单词时匹配，如果包含在另一个单词中不会被匹配到
- 通过re.findall(r"\b(um)\b",s,re.IGNORECASE)会匹配全部单独成句的um

response

import validators

if validators.email(input("What's your email? ").strip())==1:
    print("Valid")
else:
    print("Invalid")

难点：

validators
- pip install validators

CS50P week7 regular expressions

Notes

正则表达式

re库

validate

flag

解决malan@cs50.harvard.edu也可以

Cleaning Up User Input

Extracting User Input

replace

re.sub

作业

numb3rs

numb3rs.py

test_numb3rs.py

watch.py

working

working.py

test_working.py

um

um.py

response

解决`malan@cs50.harvard.edu`也可以