深度学习task2 - 文本预处理、语言模型、循环神经网络基础

最新推荐文章于 2020-02-18 16:56:18 发布

Can_9420

最新推荐文章于 2020-02-18 16:56:18 发布

阅读量208

点赞数

分类专栏：深度学习

原文链接：https://www.boyuai.com/elites/course/cZu18YmweLv10OeV/video/5SqCT0gzwEjcnnlqbUvTq

版权

深度学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文本预处理 (text pre-processing)

清洗文本（去除换行，数字，价值符号，超链接，html标签，缩写，多余的空格，标点符号）
分割单词（tokenize）
获取词性（tag）并词性还原（lemmatizer）
去除停用词和短单词(ex. len < 3)

def text_clean(text):
    # remove wrapping lines
    text = text.replace('\n', ' ')
    # remove numbers
    text = re.sub(r"\d+\.?\d*", ' ', text)
    # lowercases
    text = text.lower()
    # remove special entities
    text = re.sub(r'\$\w*', '', text)
    # remove hyperlinks
    text = re.sub(r'https?:\/\/.*\/\w*', ' ', text)
    # remove html tag
    text = re.sub(r'</?\w+[^>]*>', '', text)
    # remove abbreviation
    text = re.sub(r'\b\w{1,2}\b', ' ', text)
    # remove extra space
    text = re.sub(r'\s\s+', ' ', text).lstrip(' ')
    # remove punctuatiuon
    text = re.sub(r'[{}]+'.format(string.punctuation), ' ', text)
    
    return text
    
def get_wordnet_pos(tag):
    # 获取单词的词性
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

def filter_words(tokens, stop_words):
    """
    filter the word by nltk stopwords and length
    """
    return [w for w in tokens if w not in stop_words and len(w) > 3]

def ntlkLemmatizer(review):
	#Letmatizer
    result=[]
    updated_result = ''
    tokens=word_tokenize(review)
    tags=pos_tag(tokens)
    wnl = WordNetLemmatizer()
    for tag in tags:
        wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
        result.append(wnl.lemmatize(tag[0], pos=wordnet_pos))
    result = filter_words(result,stop_word)
    for word in result:
        updated_result = updated_result + ' ' + word
    return updated_result

语言模型

n元语法（n-grams）
序列长度增加，计算和存储多个词共同出现的概率的复杂度会呈指数级增加。n元语法通过马尔可夫假设简化模型，马尔科夫假设是指一个词的出现只与前面n个词相关，即阶马尔可夫链（Markov chain of order ），如果n = 1，那么有P(w3 | w1, w2) = P(w3 | w2)。基于n - 1阶马尔可夫链，我们可以将语言模型改写为：

当n分别为1、2和3时，我们将其分别称作一元语法（unigram）、二元语法（bigram）和三元语法（trigram）。例如，长度为4的序列在一元语法、二元语法和三元语法中的概率分别为：
在这里插入图片描述
当n较小时，元语法往往并不准确。例如，在一元语法中，由三个词组成的句子“你走先”和“你先走”的概率是一样的。然而，当n较大时，元语法需要计算并存储大量的词频和多词相邻频率。
齐夫定律 (Zipf’s Law)：在自然语言的语料库里，一个单词出现的频率与它在频率表里的排名成反比。大部分单词词频都很小，甚至不会在语料库中出现，所以对其概率的估计很不准确。并且在英语中，词频高的通常是停用词(stop words)。如果使用nnn元语法模型存在数据稀疏问题，最终计算出来的大部分参数都是0。

循环神经网络

不太懂，以后补。

Can_9420

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
深度学习task2 - 文本预处理、语言模型、循环神经网络基础

文本预处理 (text pre-processing)清洗文本（去除换行，数字，价值符号，超链接，html标签，缩写，多余的空格，标点符号）分割单词（tokenize）获取词性（tag）并词性还原（lemmatizer）去除停用词和短单词(ex. len < 3)def text_clean(text): # remove wrapping lines text...
复制链接

扫一扫