python数据分析(分析文本数据和社交媒体）

最新推荐文章于 2023-06-05 19:49:06 发布

wx1871428

最新推荐文章于 2023-06-05 19:49:06 发布

阅读量321

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/wx1871428/article/details/118675736

版权

1、安装NLTK

    pip install nltk
[/code]

至此，我们的安装还未完成，还需要下载NLTK语料库，下载量非常大，大约有1.8GB。可以直接运行代码下载、代码如下：

```code
    import nltk
    nltk.download()
[/code]python数据分析(分析文本数据和社交媒体）

这样可以直接下载NLTK语料库了。

##  2、滤除停用词、姓名和数字

进行文本分析时，我们经常需要对停用词（Stopwords）进行剔除，这里所谓停用词就是那些非常常见，但没有多大信息含量的词。

代码：

```code
    import nltk
    sw=set(nltk.corpus.stopwords.words('french'))
    print "Stop words",list(sw)[:7]
[/code]

  
运行结果：

```code
    Stop words [u'e\xfbtes', u'\xeates', u'aient', u'auraient', u'aurions', u'auras', u'serait']
[/code]

注意，这个语料库中的所有单词都是小写形式。

nltk还提供一个Gutenberg语料库。该项目是一个数字图书馆计划，旨在收集大量版权已经过期的图书，供人们在互联网上免费阅读。下面代码是加载Gutenberg语料库，并输出部分书名的代码：  

```code
    gb=nltk.corpus.gutenberg
    print "Gutenberg files",gb.fileids()[-5:]
[/code]

运行结果：

```code
    Gutenberg files [u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']
[/code]

从milton-paradise.txt中取前两个句子，并去除停用词。

代码：  

```code
    text_sent=gb.sents("milton-paradise.txt")[:2]  #取前两个句子
    print "Unfiltered:",text_sent
    
    for sent in text_sent:   #去除停用词
    	filtered=[w for w in sent if w.lower() not in sw]
    	print "Filtered:",filtered
[/code]

运行结果：

```code
    Filtered: [u'[', u'Paradise', u'Lost', u'John', u'Milton', u'1667', u']']
    Filtered: [u'Book']
[/code]

与前面相比已经滤掉了by和I，因为他们出现在停用词语料库中，有时，我们希望把文本中的数字和姓名也删掉，可以根据词性标签来删除某些单词，数字对应基数标签（CD），姓名对应着单数形式的专有名词（NNP）标签。  
代码：  

```code
    #coding:utf8
    import nltk
    sw=set(nltk.corpus.stopwords.words('english'))
    print "Stop words",list(sw)[:7]
    gb=nltk.corpus.gutenberg
    print "Gutenberg files",gb.fileids()[-5:]
    
    text_sent=gb.sents("milton-paradise.txt")[:2]  #取前两个句子
    print "Unfiltered:",text_sent
    
    for sent in text_sent:   #去除停用词
    	filtered=[w for w in sent if w.lower() not in sw]
    	print "Filtered:",filtered
    
    	taggled=nltk.pos_tag(filtered)  #输出每个词的标签数据
    	print "Tagged:",taggled
    
    	words=[]
    	for word in taggled:  #过滤标签数据
    		if word[1]!='NNP' and word[1]!='CD':
    			words.append(word[0])
    	print words
[/code]

  
运行结果：

```code
    Stop words [u'all', u'just', u'being', u'over', u'both', u'through', u'yourselves']
    Gutenberg files [u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']
    Unfiltered: [[u'[', u'Paradise', u'Lost', u'by', u'John', u'Milton', u'1667', u']'], [u'Book', u'I']]
    Filtered: [u'[', u'Paradise', u'Lost', u'John', u'Milton', u'1667', u']']
    Tagged: [(u'[', 'JJ'), (u'Paradise', 'NNP'), (u'Lost', 'NNP'), (u'John', 'NNP'), (u'Milton', 'NNP'), (u'1667', 'CD'), (u']', 'NN')]
    [u'[', u']']
    Filtered: [u'Book']
    Tagged: [(u'Book', 'NN')]
    [u'Book']
[/code]

##  3、词袋模型

所谓词袋模型，即它认为一篇文档是由其中的词构成的一个集合，词与词之间没有顺序以及先后的关系。对于文档中的每个单词，我们都需要计算它出现的次数，即单词计数，据此，我们可以进行垃圾邮件识别之类的统计分析。

利用所有单词的计数，可以为每个文档建立一个特征向量，如果一个单词存在于语料库中，但是不存在于文档中，那么这个特征的值就为0，nltk中并不存在创建特征向量的应用程序，需要借助python机器

最低0.47元/天解锁文章

wx1871428

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python数据分析(分析文本数据和社交媒体）

1、安装NLTK pip install nltk[/code]至此，我们的安装还未完成，还需要下载NLTK语料库，下载量非常大，大约有1.8GB。可以直接运行代码下载、代码如下：```code import nltk nltk.download()[/code]python数据分析(分析文本数据和社交媒体）这样可以直接下载NLTK语料库了。## 2、滤除停用词、姓名和数字进行文本分析时，我们经常需要对停用词（Stopwords）进行剔除，这里所谓停用词就是
复制链接

扫一扫