任务2：ChatGPT编写正则

Orpheus2333

已于 2023-08-24 22:43:59 修改

阅读量250

点赞数

文章标签： chatgpt 人工智能

于 2023-08-24 21:45:38 首次发布

本文链接：https://blog.csdn.net/Orpheus2333/article/details/132483161

版权

任务描述

任务说明：在ChatGPT中编写和使用正则表达式，以实现文本匹配和模式提取的功能。
待匹配文本：

Enron Dataset: Over half a million anonymized emails from over 100 users. It’s one of the few publically available collections of “real” emails available for study and training sets.

Google Blogger Corpus: Nearly 700,000 blog posts from blogger.com. The meat of the blogs contain commonly occurring English words, at least 200 of them in each entry.

SMS Spam Collection: Excellent dataset focused on spam. Nearly 6000 messages tagged as legitimate or spam messages with a useful subset extracted directly from Grumbletext.

Recommender Systems Datasets: Datasets from a variety of sources, including fitness tracking, video games, song data, and social media. Labels include star ratings, time stamps, social networks, and images.

Project Gutenberg: Extensive collection of book texts. These are public domain and available in a variety of languages, spanning a long period of time.

实践步骤：

编写prompt让ChatGPT写一个能识别首字母大写单词的正则。
编写prompt让ChatGPT写一个能识别首字母大写且字符个数小于10的正则。
编写prompt让ChatGPT写一个能识别单词末尾为标点符号的正则。
上述实验过程进行截图，通过Python代码验证ChatGPT输出正则的有效性。

ChatGPT Prompt

1、让ChatGPT写一个能识别首字母大写单词的正则

2、让ChatGPT写一个能识别首字母大写且字符个数小于10的正则。

3、编写prompt让ChatGPT写一个能识别单词末尾为标点符号的正则。

Prompt：写一个能识别单词末尾为标点符号的正则表达式。英文中的标点符号包括以下15种：the period （.），question mark （?），exclamation point（!）， comma（,），semicolon（;），colon（:），dash（—），hyphen（-），parentheses（()），brackets（[] ），braces（{}）， apostrophe（'），quotation marks（''），double quotation marks（""），and ellipses（...）。请在结果中添加转义字符，避免某些标点符号无法被正确解析。

Python代码验证

源代码：

import re

# 读取文件内容
def read_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content

# 识别首字母大写单词的正则，并输出这些单词
def find_capitalized_words(text):
    pattern = r'\b[A-Z][a-zA-Z]*\b'
    capitalized_words = re.findall(pattern, text)
    return capitalized_words

# 识别首字母大写且字符个数小于10的正则，并输出这些单词
def find_short_capitalized_words(text):
    pattern = r'\b[A-Z][a-zA-Z]{0,9}\b'
    short_capitalized_words = re.findall(pattern, text)
    return short_capitalized_words

# 识别单词末尾为标点符号的正则，并输出这些单词
def find_words_with_punctuation(text):
    pattern = r'\b\w+[.,?!;:—\(\)\[\]{}\'\"“”‘’"](?=[a-zA-Z]|\s|$)'
    words_with_punctuation = re.findall(pattern, text)
    words_without_punctuation = [word[:-1] for word in words_with_punctuation]
    return words_without_punctuation

# 主函数
def main():
    # file_path = 'your_file.txt'  # 替换为你的文件路径
    file_path = '/home/aistudio/data/input.txt'
    content = read_file(file_path)
    
    capitalized_words = find_capitalized_words(content)
    print("Capitalized words:", capitalized_words)
    
    short_capitalized_words = find_short_capitalized_words(content)
    print("Short capitalized words:", short_capitalized_words)
    
    words_with_punctuation = find_words_with_punctuation(content)
    print("Words with punctuation:", words_with_punctuation)

if __name__ == "__main__":
    main()

验证截图：

任务3 chatGPT输出的prompt识别有误，经过人工优化后得到新的正则表达式

\b\w+[.,?!;:—\[\]{}\'\"“”‘’"](?=[a-zA-Z]|\s|$)

(?=[a-zA-Z]|\s|$): 这是一个非捕获组，匹配字母字符（大小写）或空白字符（\s）或字符串结尾（$）的条件。

在百度飞桨平台上运行后输出：