一篇文章入门python字符串处理、正则表达式和NLTK工具包

最新推荐文章于 2022-12-07 13:56:52 发布

一只楚楚猫

最新推荐文章于 2022-12-07 13:56:52 发布

阅读量605

点赞数 1

本文链接：https://blog.csdn.net/julac/article/details/127374101

版权

python 同时被 2 个专栏收录

32 篇文章 0 订阅

订阅专栏

NLP

8 篇文章 0 订阅

订阅专栏

python字符串处理

去掉空格或特殊字符

input_str='今天天气不错，风和日丽'
print(input_str.strip())

替换操作

print(input_str.replace("今天","昨天"))

查找操作

print(input_str.index('今天'))

判断操作

print(input_str.isalpha())

print(input_str.isdigit())

分割合并操作

input_str=input_str.split("，")
print(input_str)

input_str=" ".join(input_str)
print(input_str)

正则表达式

re.compile

将正则表达式模式编译为正则表达式对象，可使用match()，search()等方法将其应用于匹配

re.search

扫描字符串以查找正则表达式模式产生匹配项的第一个位置，然后返回相应的match对象。

import re

string = '123'
pattern = re.compile(r'\d+') # 贪婪模式

search=re.search(pattern,string).group() # 通过调用group()方法得到匹配的字符串
print(search) # 123

pattern = re.compile(r'\d+?') # 非贪婪模式
search=re.search(pattern,string).group()
print(search) # 1

re.match

如果字符串开头的零个或多个字符与正则表达式模式匹配，则返回相应的匹配对象。

import re

string='a123'
pattern=re.compile(r'\w+')

# 同search，不过在字符串开始处进行匹配
match=re.match(pattern,string).group()
print(match) # a123

re.fullmatch

如果整个字符串与正则表达式模式匹配，则返回相应的match对象。

import re

string='a123'
pattern=re.compile(r'\w+')

full_match=re.fullmatch(pattern,string).group()
print(full_match) # a123

re.split

使用指定的正则规则在目标字符串中查找匹配的字符串，用它们作为分界，把字符串切片。

import re

string = "自然语言处理123深度学习456机器学习"
pattern = re.compile(r'\d+')
split = re.split(pattern, string)
print(split)

"(?P…)"命名组

import re

string = "自然语言处理123深度学习456机器学习"
pattern = re.compile(r'(?P<digital>\d+)(?P<NoDigital>\D+)')
search = re.search(pattern, string).group("digital")
print(search)

re.findall

以string列表形式返回string中pattern的所有非重叠匹配项

import re

string='this is a beautiful place'
pattern=re.compile('a')

findall=re.findall(pattern,string)
print(findall) # ['a', 'a', 'a']

re.finditer

返回一个迭代器（有关迭代器的知识详见：一篇文章入门python基础）

import re

string='this is a beautiful place'
pattern=re.compile('[ab]')

finditer=re.finditer(pattern,string)
[print(i.group()) for i in finditer]

re.sub

re.sub（pattern，repl，string，count = 0，flags = 0 ）

返回通过用替换repl替换字符串中最左边的不重叠模式所获得的字符串。如果找不到该模式，则返回的字符串不变。 repl可以是字符串或函数；count参数表示将匹配到的内容进行替换的次数

import re

string = '123'
pattern = re.compile(r'\d')
sub = re.sub(pattern, "数字", string)
print(sub) # 数字数字数字

re.subn

re.subn（pattern，repl，string，count = 0，flags = 0 ）

执行与相同的操作sub()，但返回一个元组

import re

string = '123'
pattern = re.compile(r'\d')
subn = re.subn(pattern, "digital", string)
print(subn) # ('digitaldigitaldigital', 3)

search（）与match（）方法

re.match()仅在字符串的开头匹配，re.search()检查匹配项，在字符串中的任何位置检查匹配项

re.match("c", "abcdef") #Not match
re.search("c", "abcdef") #match

以开头的正则表达式’^'可用于search()限制字符串开头的匹配项

re.match("c", "abcdef") #Not match
re.search("^c", "abcdef") #Not match
re.search("^a", "abcdef") #match

NLTK工具包

import nltk

nltk.download()

from nltk.tokenize import word_tokenize
from nltk.text import Text

input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon."
tokens = word_tokenize(input_str)

# 创建Text对象
text=Text(tokens)

text.plot(8)

停用词

停用词，就是对句义没有多大影响的词，比如：今天打篮球、今天看电影，在上述例子中，“打篮球”和“看电影”是影响句义的词，而“今天”则不那么重要，因此可以把”今天“看成停用词

在nltk.corpus模块中包含停用词列表

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stopwords_list=stopwords.raw("english").replace('\n',' ')
print(stopwords_list)

input_str="Today's weather is good, very windy and sunny, we have no classes in the afternoon."
tokens = word_tokenize(input_str)
tokens_lower=[token.lower() for token in tokens]
tokens_lower_set=set(tokens_lower)

# 过滤停用词
filters=tokens_lower_set.intersection(stopwords_list)

tokens_filter=[token for token in tokens_lower_set if token not in filters]

print(tokens_filter)

词性标注

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpChunkParser

input_str="Today's weather is good, very windy and sunny, we have no classes in the afternoon."
tokens = word_tokenize(input_str)
tags=pos_tag(tokens)

# 分块
sentence=[('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('died',"VBD")]
grammer="MY_NP:{<DT>?<JJ>*<NN>}"
cp=nltk.RegexpParser(grammer) # 生成规则
result=cp.parse(sentence)

result.draw()

命名实体识别

一个文章——>一些句子——>一些词——>标注每一个词的词性——>命名实体识别

# -*- coding: utf-8 -*-
# @Time    : 2022/10/17 18:48
# @Author  : 楚楚
# @File    : 06命名实体识别.py
# @Software: PyCharm
from nltk import ne_chunk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

sentence = "Edison went to Tsinghau University today."

# 分词
tokens = word_tokenize(sentence)

# 词性标注
tags = pos_tag(tokens)

# 命名实体识别
identifier = ne_chunk(tags)
print(identifier)

数据清洗

# -*- coding: utf-8 -*-
# @Time    : 2022/10/17 19:06
# @Author  : 楚楚
# @File    : 07数据清洗.py
# @Software: PyCharm
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

string = "  Edison RT @amila #test Co went to &amp; http://www.baidu.com/ Tsinghua University today."

# 停用词
cache_english = stopwords.words("english")


def clean(text):
    print(f"raw data：{text}")

    # 去掉HTML标签
    text_no_special_entity = re.sub(r"&\w*; | #\w* | @\w*", ' ', text)
    print(f"去掉特殊HTML标签：{text_no_special_entity}")

    # 去掉一些价值符号
    text_no_tickers = re.sub(r'\$\w*', ' ', text_no_special_entity)
    print(f"去掉价值符号：{text_no_tickers}")

    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?:\/\/.*\/', '', text_no_tickers)
    print(f"去掉超链接：{text_no_hyperlinks}")

    # 去掉一些专有名词缩写，简单来说就是字母比较少的词
    text_no_small_word = re.sub(r'\b[A-Z][a-zA-Z]\b', '', text_no_hyperlinks)
    print(f"去掉专有名词：{text_no_small_word}")

    # 去掉多余空格
    text_no_whitespace = re.sub(r'\s\s+', " ", text_no_small_word)
    text_no_whitespace = text_no_whitespace.strip()
    print(f"去掉多余空格：{text_no_whitespace}")

    # 分词
    tokens = word_tokenize(text_no_whitespace)

    # 去掉停用词
    tokens_set = set(tokens)
    cache_english_list = tokens_set.intersection(cache_english)

    filters = [token for token in tokens if token not in cache_english_list]
    print(f"去掉停用词：{filters}")

    filters_text = " ".join(filters)
    print(f"过滤后：{filters_text}")

    return filters_text


clean = clean(string)
print(clean)