Machine Learning 学习(1)
目录
开始学习机器学习,本系列主要记录学习内容和代码,不定时更新。
本次例题是:
分析判断正面评论和负面评论。
Feature Engineering
将数据转换成向量(vector)是一个非常关键的部分,也是很难的一个部分,如果有相关行业知识会非常有帮助。
导入 numpy,nltk,sklearn,operator,requests等需要的包:
import numpy as np
import nltk
import sklearn
import operator
import requests
nltk.download('stopwords') # If needed
输入正面评论和负面评论:
url_pos="http://josecamachocollados.com/rt-polarity.pos.txt" # Containing all positive reviews, one review per line
url_neg="http://josecamachocollados.com/rt-polarity.neg.txt" # Containing all negative reviews, one review per line
#Load positive reviews
response_pos = requests.get(url_pos)
dataset_file_pos = response_pos.text.split("\n")
#Load negative reviews
response_pos = requests.get(url_pos)
dataset_file_neg = response_pos.text.split("\n")
定义一个函数,将句子集全部分解成单词集:
limatizer=nltk.stem.WordNetLemmatizer()
def transfer_word(sentence): #解词函数
token_list=[]
sentencetoken=nltk.tokenize.word_tokenize(sentence)
#print("~~~~~~~~~~~~~~~")
for token in sentencetoken:
token_list.append(limatizer.lemmatize(token).lower())
return token_list
定义一个函数,根据特征向量,统计每句话对应的数据向量
def get_vector(word_dic,string): #transfer data into vectors
word_vector=np.zeros(len(word_dic))
wordTree={}
for word in string:
if word in wordTree: wordTree[word]+=1
else: wordTree[word]=1
for i in range(len(word_dic)-1):
if word_dic[i] in wordTree:word_vector[i]=wordTree[word_dic[i]]
else:word_vector[i]=0
return word_vector
定义一个函数,将所有单词统计出来,每一个单词是一个特征,构成特征向量:
def train_svm_classifier(dataset_file_pos, dataset_file_neg, x):
#To complete
wordlist_pos=[]
wordlist_neg=[]
wordlist_pos=transfer_word(str(dataset_file_pos))
wordlist_neg=transfer_word(str(dataset_file_neg))
############解词pos&neg
stopwords=set(nltk.corpus.stopwords.words('english'))
stopwords.add(".")
stopwords.add(",")
stopwords.add("--")
stopwords.add("``")
stopwords.add("'")
stopwords.add("[")
stopwords.add("]")
dictionary={}
for word in wordlist_pos:
if word in stopwords:continue
if word not in dictionary:dictionary[word]=1
else: dictionary[word]+=1
for word in wordlist_neg:
if word in stopwords:continue
if word not in dictionary:dictionary[word]=1
else: dictionary[word]+=1
#############取词,跳过stopwords
sortlist=sorted(dictionary.items(),key=operator.itemgetter(1),reverse=True)[:x] #取每一个item进行倒置排序,key为第二域值
#for word,frequency in sortlist:
#print("word:{}-frquency:{}",word,frequency)
vocabulary=[]
for word,frequency in sortlist:
vocabulary.append(word)
#############排序并找到vector的feature
return vocabulary
训练模型:
特征向量作为input->X_Train
结果作为output->Y_Train
通过SVM训练模模型,如下图:
X_Train=[]
Y_Train=[]
vocabulary=[]
vocabulary=train_svm_classifier(dataset_file_pos,dataset_file_neg,1200)
#print (vocabulary)
for sentence in dataset_file_pos:
#print(sentence)
X_Train.append(get_vector(vocabulary,sentence))
Y_Train.append(1)
for sentence in dataset_file_neg:
X_Train.append(get_vector(vocabulary,sentence))
Y_Train.append(0)
#print (X_Train[1])
#print("Start training")
X_train_sentanalysis=np.asarray(X_Train)
Y_train_sentanalysis=np.asarray(Y_Train)
#print(Y_train_sentanalysis[34])
svm_clf_sentanalysis=sklearn.svm.SVC(gamma='auto')
svm_clf_sentanalysis.fit(X_train_sentanalysis,Y_train_sentanalysis)
Feature Selection
从特征中选取重要特征,以免过度拟合(overfitting),更精确的特征可以减少模拟过程的时间,提高效率。
导入需要的包
.
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
从特征中,根据结果只抽取7个特征值:
.
fs_sentanaly=SelectKBest(chi2, k=7).fit(X_train_sentanalysis, Y_train_sentanalysis)
#选择500个特征,根据特征数据和标签数据,模拟出函数
X_train_new = fs_sentanalysis.transform(X_train_sentanalysis)
#转换特征向量
svm_clf_sentanaly=sklearn.svm.SVC(gamma='auto')
svm_clf_sentanaly.fit(X_train_new,Y_train_sentanalysis)