python--电影评论文本情感分类

最新推荐文章于 2024-05-22 23:11:14 发布

weixin_40822389

最新推荐文章于 2024-05-22 23:11:14 发布

阅读量5.6k

点赞数

分类专栏： kaggle 文章标签： python

本文链接：https://blog.csdn.net/weixin_40822389/article/details/79429105

版权

kaggle 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

为了记录kaggle学习心得。

参考了大神文章。

1.http://www.cnblogs.com/lijingpeng/p/5787549.html

2.python机器学习及实战

from sklearn.datasets import fetch_20newsgroups

X, y = news.data , news.target

查看X的长度，以及X[0]的长度

print(len(X) ,len(X[0]),len(X[0][0]))

from bs4 import BeautifulSoup

import nltk ,re

news = fetch_20newsgroups(subset='all')

def news_to_sentences(news):

news_text = BeautifulSoup(news).get_text() # 去掉HTML标签，拿到内容
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
raw_sentences = tokenizer.tokenize(news_text)
sentences = []
for sent in raw_sentences:
sentences.append(re.sub('[^a-zA-Z]', ' ', sent.lower().strip()).split()) # 小写化所有的词，并转成词list用正则表达式取出符合规范的部分
return sentences

sentences = []

for x in X:
    sentences += news_to_sentences(x)

from gensim.models import word2vec


num_features = 300                       
min_word_count = 20                        
num_workers = 2    
context = 5                                                                               
downsampling = 1e-3   


from gensim.models import word2vec

model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

model.init_sims(replace=True)

model.most_similar('morning')

from sklearn.datasets import fetch_20newsgroups

X, y = news.data , news.target

查看X的长度，以及X[0]的长度

print(len(X) ,len(X[0]),len(X[0][0]))

from bs4 import BeautifulSoup

import nltk ,re

news = fetch_20newsgroups(subset='all')

def news_to_sentences(news):

    news_text = BeautifulSoup(news).get_text()
    
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    raw_sentences = tokenizer.tokenize(news_text)
    
    sentences = []
    
    for sent in raw_sentences:
        sentences.append(re.sub('[^a-zA-Z]', ' ', sent.lower().strip()).split())
    return sentences

sentences = []

for x in X:
    sentences += news_to_sentences(x)

from gensim.models import word2vec


num_features = 300                       
min_word_count = 20                        
num_workers = 2    
context = 5                                                                               
downsampling = 1e-3   


from gensim.models import word2vec

model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)

model.init_sims(replace=True)

model.most_similar('morning')

weixin_40822389

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
python--电影评论文本情感分类

为了记录kaggle学习心得。参考了大神文章。1.http://www.cnblogs.com/lijingpeng/p/5787549.html2.python机器学习及实战from sklearn.datasets import fetch_20newsgroupsX, y = news.data , news.target查看X的长度，以及X[0]的长度print(len(X) ,le...
复制链接

扫一扫