NLP之预处理

最新推荐文章于 2024-07-20 19:16:01 发布

RunningQzh

最新推荐文章于 2024-07-20 19:16:01 发布

阅读量622

点赞数

文章标签：自然语言处理深度学习机器学习

本文链接：https://blog.csdn.net/weixin_40982849/article/details/120172277

版权

第一周

一、语言模型CDOW和Skip-gram

CBOW模型是输入某个特征词的上下文，推测特征值。即给出上下文的词向量，推测出中心词。

an efficient method for learning high quality distributed vertor

例如这段话，上下文的取值都是4，learning是中心词，所以上下文的8个词向量是模型的输入，learning是模型的输出词向量。
CBOW使用的是词袋模式，因此不考虑这8个词与我们中心词的距离大小，每个都是平等的，只要在上下文范围内即可。

在这个CBOW例子中，输入是8个词向量，输出是所有词的softmax概率，我们期望训练样本特定词对应的softmax最大。对应的输入层有8个神经元，输出层有对应的有词汇表大小个神经元，隐层的大小可以随意设定，搭建DNN模型利用反向传播优化参数。

Skip-gram模型的设计理念和CBOW正好是反过来的，还是上面的例子，输入为learning，输出为其余的8个单词。即：输入是特定词，输出是softmax概率排名前八的单词.

二、探究TF-IDF的原理

TF-IDF用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度，通常用于提取文本的特征，即关键词。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。(即：待对比的文件中某词出现的次数多，那么这个词就越重要，但语料库中若此词语出现次数太多，那这个词就越不重要了，因为大家都有，降低了此词语的独特性)

在TF-IDF中的计算公式如下：

tfidf = tf * idf

其中tf是词频，代表一个文章中某词的出现频率。如某文章有共100个词，而某词出现的次数为12次，则tf = 12/100 = 0.12
idf是逆向文件频率，可以代表某词的独特性。如整个文档有n篇文章，其中k篇文章中都出现过这个词，那么这个词的
idf = log2(n/k).

___自己实现：

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""
import math
from nltk.corpus import stopwords     #停用词
from gensim import corpora, models, matutils

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    #去掉停用词
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]

# print(countlist)

def tf(word,text):
    return text.count(word)/len(text)

def idf(word,countlist):
    sum_ = sum(1 for text in countlist if word in text)
    return math.log2(len(countlist)/sum_)

def tfidf(word,text,countlist):
    return tf(word,text)*idf(word,countlist)

for i, doc in enumerate(countlist, 1):
    print(f"Top words in document{i}")
    scores = {word: tfidf(word,doc,countlist) for word in doc}
    sorted_scores = sorted(scores.items(),key=lambda x:x[1],reverse=True)
    
    for word,score in sorted_scores[:3]:
        print(f'{word},    TF-IDF:{round(score,5)}')

___调用gensim包:

from nltk.corpus import stopwords     #停用词
from gensim import corpora, models, matutils


# 文本预处理
# 函数：text文件分句，分词，并去掉标点
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分词
            if word not in string.punctuation: # 去掉标点
                tokens.append(word)
    return tokens

print(get_tokens(text1))

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    #去掉停用词
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]

# 1. 创建字典
dictionary = corpora.Dictionary(countlist)

# 转化成 id ：word ，方便最后通过id查找词语。
new_dict = {v:k for k,v in dictionary.token2id.items()}

# 2. 生成语料库 （单词：出现次数）
corpus2 = [dictionary.doc2bow(count) for count in countlist]

# 3. 向TfidfModel模型传入语料库 得到结果
tfidf2 = models.TfidfModel(corpus2)

# 4. 结果的[语料库]  获取到corpus2的tfidf
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf, 1):
    print(f"Top words in document{i}")
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))

三、词形还原（Lemmatization）

词形还原，简单来说就是去掉单词的词缀，提取单词的主干部分（找单词的原型）。

比如，单词“cars”词形还原后的单词为“car”，单词“ate”词形还原后的单词为“eat”。

在Python的nltk模块中，使用WordNet为我们提供了稳健的词形还原的函数：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

在上述代码中，wnl.lemmatize( )方法可以提供词性还原的功能，第一个参数是单词，第二个参数是词性，包括名词n、动词v、形容词a等，词性不能写错，不然得不到期望的原型。

___那该怎么样判断单词在句子中的词性呢？

在NLP中，使用Parts of speech（POS）技术实现。在nltk中，可以使用nltk.pos_tag()获取单词在句子中的词性，如以下Python代码：

sentence = 'The brown fox is quick and he is jumping over the lazy dog'
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)
print(tagged_sent)

[(‘The’, ‘DT’), (‘brown’, ‘JJ’), (‘fox’, ‘NN’), (‘is’, ‘VBZ’), (‘quick’, ‘JJ’), (‘and’, ‘CC’), (‘he’, ‘PRP’), (‘is’, ‘VBZ’), (‘jumping’, ‘VBG’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘JJ’), (‘dog’, ‘NN’)]

N开头的是名词，J开头的形容词，V开头的是动词，R开头的是副词。

___现在已经学会了分词、获取单词在句子中的词性，再结合词形还原，就能很好地完成词形还原功能。

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


sentence = 'The brown fox is quick and he is jumping over the lazy dog'

def get_wordtag(tag):
    if tag.startswith('J'):
        return  wordnet.ADJ
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
    
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)

wnl = WordNetLemmatizer()
lemmas_sent = []
for word,tag in tagged_sent:
    wordtag = get_wordtag(tag) or wordnet.NOUN
    #print(word+" "+wordtag)
    lemmas_sent.append(wnl.lemmatize(word,wordtag))
print(tokens)
print(lemmas_sent)

[‘The’, ‘brown’, ‘fox’, ‘is’, ‘quick’, ‘and’, ‘he’, ‘is’, ‘jumping’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
[‘The’, ‘brown’, ‘fox’, ‘be’, ‘quick’, ‘and’, ‘he’, ‘be’, ‘jump’, ‘over’, ‘the’, ‘lazy’, ‘dog’]

可以看到i，我们已经可以做到把一句话分词，然后根据词性做做词形还原，得到还原后的单词列表。