Python---爬虫---清洗---NLTK

最新推荐文章于 2024-07-08 16:14:38 发布

agsddd

最新推荐文章于 2024-07-08 16:14:38 发布

阅读量421

点赞数

分类专栏：爬虫开发爬虫开发历程

本文链接：https://blog.csdn.net/weixin_41245276/article/details/88358394

版权

爬虫开发同时被 2 个专栏收录

46 篇文章 0 订阅

订阅专栏

爬虫开发历程

20 篇文章 0 订阅

订阅专栏

安装语料库:

import nltk 
nltk.download()

NLTK自带语料库:

>>> from nltk.corpus import brown
>>> brown.categories()
['adventure', 'belles_lettres', 'editorial',
'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']
>>> len(brown.sents())
57340
>>> len(brown.words())
1161192

Tokenize(把长句子拆成有“意义”的小部件):

>>> import nltk
>>> sentence = “hello, world"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['hello', ‘,', 'world']

词形归一化:

Stemming 词干提取:一般来说，就是把不影响词性的inflection的小尾巴砍掉. walking 砍ing = walk walked 砍ed = walk

>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(‘maximum’)
u’maximum’
>>> porter_stemmer.stem(‘presumably’)
u’presum’
>>> porter_stemmer.stem(‘multiply’)
u’multipli’
>>> porter_stemmer.stem(‘provision’)
u’provis’

>>> from nltk.stem.lancaster import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem(‘maximum’)
‘maxim’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’
>>> lancaster_stemmer.stem(‘presumably’)
‘presum’

>>> from nltk.stem import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer(“english”)
>>> snowball_stemmer.stem(‘maximum’)
u’maximum’
>>> snowball_stemmer.stem(‘presumably’)
u’presum’

>>> from nltk.stem.porter import PorterStemmer
>>> p = PorterStemmer()
>>> p.stem('went')
'went'
>>> p.stem('wenting')
'went'

Lemmatization 词形归一:把各种类型的词的变形，都归为一个形式 went 归一 = go,are 归一 = be

>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(‘dogs’)
u’dog’
>>> wordnet_lemmatizer.lemmatize(‘churches’)
u’church’
>>> wordnet_lemmatizer.lemmatize(‘aardwolves’)
u’aardwolf’
>>> wordnet_lemmatizer.lemmatize(‘abaci’)
u’abacus’
>>> wordnet_lemmatizer.lemmatize(‘hardrock’)
‘hardrock’

Lemma的小问题:Went v. go的过去式 n. 英文名:温特

NLTK更好地实现Lemma:

#  有POS Tag，默认是NN 名词
>>> wordnet_lemmatizer.lemmatize(‘are’) ‘are’
>>> wordnet_lemmatizer.lemmatize(‘is’) ‘is’
# 加上POS Tag
>>> wordnet_lemmatizer.lemmatize(‘is’, pos=’v’) u’be’
>>> wordnet_lemmatizer.lemmatize(‘are’, pos=’v’) u’be’

NLTK标注POS Tag:

>>> import nltk
>>> text = nltk.word_tokenize('what does the fox say')
>>> text
['what', 'does', 'the', 'fox', 'say']
>>> nltk.pos_tag(text)
[('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]

NLTK去除stopwords：

from nltk.corpus import stopwords # 先token 把，得到 个word_list
# ...
# 然后filter 把
filtered_words =
[word for word in word_list if word not in stopwords.words('english')]

NLTK完成简单的情感分析：

sentiment_dictionary = {}
for line in open('data/AFINN-111.txt')
    word, score = line.split('\t')
    sentiment_dictionary[word] = int(score)
# 把这个打分表记录在 个Dict上以后 # 跑 遍整个  ，把对应的值相加
total_score = sum(sentiment_dictionary.get(word, 0) for word in words) # 有值就是Dict中的值，没有就是0
# 于是你就得到  个 sentiment score

Frequency 频率统计：

import nltk
from nltk import FreqDist
# 做个词库先
corpus = 'this is my sentence ' \
           'this is my life ' \
           'this is the day'
# 随 tokenize 下
# 显然, 正如上 提到,
# 这 可以根据需要做任何的preprocessing:
# stopwords, lemma, stemming, etc.
tokens = nltk.word_tokenize(corpus) 
print(tokens)
# 得到token好的word list
# ['this', 'is', 'my', 'sentence',
# 'this', 'is', 'my', 'life', 'this', 'is', 'the', 'day']

# 借 NLTK的FreqDist统计 下 字出现的频率 
fdist = FreqDist(tokens)
# 它就类似于 个Dict
# 带上某个单词, 可以看到它在整个 章中出现的次数
print(fdist['is']) 
#3



# 好, 此刻, 我们可以把最常 的50个单词拿出来 
standard_freq_vector = fdist.most_common(50) 
size = len(standard_freq_vector) 
print(standard_freq_vector)
# [('is', 3), ('this', 3), ('my', 2),
# ('the', 1), ('day', 1), ('sentence', 1),
# ('life', 1)


# Func: 按照出现频率  , 记录下每 个单词的位置 
def position_lookup(v):
    res = {}
    counter = 0
    for word in v:
        res[word[0]] = counter
        counter += 1
    return res
# 把标准的单词位置记录下来
standard_position_dict = position_lookup(standard_freq_vector) print(standard_position_dict)
# 得到 个位置对照表
# {'is': 0, 'the': 3, 'day': 4, 'this': 1,
# 'sentence': 5, 'my': 2, 'life': 6}


# 这时, 如果我们有个新  :
sentence = 'this is cool'
# 先新建个跟我们的标准vector同样大小的向量
freq_vector = [0] * size
# 简单的Preprocessing
tokens = nltk.word_tokenize(sentence) 
# 对于这个新句子的每个单词
for word in tokens:
    try:
        # 如果在我们的词库 出现过
        # 那么就在"标准位置"上+1 
        freq_vector[standard_position_dict[word]] += 1
    except KeyError: 
        # 如果是个新词
        # 就pass掉 
        continue
print(freq_vector)
# [1, 1, 0, 0, 0, 0, 0]
# 第一个位置代表 is, 出现一次
# 第一个位置代表 this, 出现一次 
# 后面都没有

NLTK实现TF-IDF：

TF:衡量一个term在文档中出现得有多频繁TF(t) = (t出现在文档中的次数) / (文档中的term总数).

IDF:衡量一个term有多重要.有些词出现很多，对分类没什么意思，比如虚词。为了平衡，我们把罕见的词的重要性(weight)搞高，把常见词的重要性搞低。

IDF(t) = log(文档总数 / 含有t的文档总数).

from nltk.text import TextCollection
#  先, 把所有的 档放到TextCollection类中。
# 这个类会 动帮你断 , 做统计, 做计算
corpus = TextCollection(['this is sentence one',
                        'this is sentence two',
                        'this is sentence three'])
# 直接就能算出tfidf
# (term:   话中的某个term, text: 这 话)
print(corpus.tf_idf('this', 'this is sentence four'))
# 0.444342
# 同 , 怎么得到 个标准  的vector来表示所有的  ?
# 对于每个新  
new_sentence = 'this is sentence five' 
# 遍历 遍所有的vocabulary中的词:
for word in standard_vocab:
    print(corpus.tf_idf(word, new_sentence)) 
    # 我们会得到 个巨 (=所有vocab 度)的向

agsddd

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python---爬虫---清洗---NLTK

安装语料库:import nltk nltk.download()NLTK自带语料库:&gt;&gt;&gt; from nltk.corpus import brown&gt;&gt;&gt; brown.categories()['adventure', 'belles_lettres', 'editorial','fiction', 'government', 'ho...
复制链接

扫一扫

专栏目录