机器学习machine learning(1) | 特征提取

最新推荐文章于 2024-07-28 23:59:53 发布

Ghost1898688

最新推荐文章于 2024-07-28 23:59:53 发布

阅读量509

点赞数

分类专栏：机器学习文章标签：机器学习 machine learning 特征提取

本文链接：https://blog.csdn.net/Ghost1898688/article/details/102645592

版权

机器学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文是机器学习系列的第一篇，主要探讨特征工程和特征选择的重要性。在分析正负评论的案例中，作者强调了将数据转化为向量的挑战，并指出行业知识对此有极大帮助。特征向量用于训练SVM模型，而特征选择则有助于避免过拟合，提高模型效率。

摘要由CSDN通过智能技术生成

Machine Learning 学习（1）

目录
- Feature Engineering
- Feature Selection

url_pos="http://josecamachocollados.com/rt-polarity.pos.txt" # Containing all positive reviews, one review per line
url_neg="http://josecamachocollados.com/rt-polarity.neg.txt" # Containing all negative reviews, one review per line
#Load positive reviews
response_pos = requests.get(url_pos)
dataset_file_pos = response_pos.text.split("\n")

#Load negative reviews
response_pos = requests.get(url_pos)
dataset_file_neg = response_pos.text.split("\n")

定义一个函数，将句子集全部分解成单词集：

limatizer=nltk.stem.WordNetLemmatizer()
def transfer_word(sentence): #解词函数
   token_list=[]
   sentencetoken=nltk.tokenize.word_tokenize(sentence)
   #print("~~~~~~~~~~~~~~~")
   for token in sentencetoken:
       token_list.append(limatizer.lemmatize(token).lower())
   return token_list

定义一个函数，根据特征向量，统计每句话对应的数据向量

def get_vector(word_dic,string): #transfer data into vectors
    word_vector=np.zeros(len(word_dic))
    wordTree={}
    for word in string:
        if word in wordTree: wordTree[word]+=1
        else: wordTree[word]=1
    for i in range(len(word_dic)-1):
        if word_dic[i] in wordTree:word_vector[i]=wordTree[word_dic[i]]
        else:word_vector[i]=0
    return word_vector

定义一个函数，将所有单词统计出来，每一个单词是一个特征，构成特征向量：

def train_svm_classifier(dataset_file_pos, dataset_file_neg, x):
  #To complete 
    wordlist_pos=[]
    wordlist_neg=[]
    wordlist_pos=transfer_word(str(dataset_file_pos))
    wordlist_neg=transfer_word(str(dataset_file_neg))
    ############解词pos&neg
    stopwords=set(nltk.corpus.stopwords.words('english'))
    stopwords.add(".")
    stopwords.add(",")
    stopwords.add("--")
    stopwords.add("``")
    stopwords.add("'")
    stopwords.add("[")
    stopwords.add("]")
    dictionary={}
    for word in wordlist_pos:
        if word in stopwords:continue
        if word not in dictionary:dictionary[word]=1
        else: dictionary[word]+=1
    for word in wordlist_neg:
        if word in stopwords:continue
        if word not in dictionary:dictionary[word]=1
        else: dictionary[word]+=1
    #############取词，跳过stopwords
    sortlist=sorted(dictionary.items(),key=operator.itemgetter(1),reverse=True)[:x] #取每一个item进行倒置排序，key为第二域值
    #for word,frequency in sortlist:
        #print("word:{}-frquency:{}",word,frequency)
    vocabulary=[]
    for word,frequency in sortlist:
        vocabulary.append(word)  
    #############排序并找到vector的feature
    return vocabulary

Output
训练模型：
特征向量作为input->X_Train
结果作为output->Y_Train
通过SVM训练模模型，如下图：

X_Train=[]
Y_Train=[]
vocabulary=[]
vocabulary=train_svm_classifier(dataset_file_pos,dataset_file_neg,1200)
#print (vocabulary)
for sentence in dataset_file_pos: 
    #print(sentence)
    X_Train.append(get_vector(vocabulary,sentence))
    Y_Train.append(1)
for sentence in dataset_file_neg:
    X_Train.append(get_vector(vocabulary,sentence))
    Y_Train.append(0)

#print (X_Train[1])
#print("Start training")

X_train_sentanalysis=np.asarray(X_Train)
Y_train_sentanalysis=np.asarray(Y_Train)
#print(Y_train_sentanalysis[34])

svm_clf_sentanalysis=sklearn.svm.SVC(gamma='auto')
svm_clf_sentanalysis.fit(X_train_sentanalysis,Y_train_sentanalysis)

Output

Feature Selection

从特征中选取重要特征，以免过度拟合(overfitting)，更精确的特征可以减少模拟过程的时间，提高效率。

导入需要的包.

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

从特征中，根据结果只抽取7个特征值：.

fs_sentanaly=SelectKBest(chi2, k=7).fit(X_train_sentanalysis, Y_train_sentanalysis) 
#选择500个特征，根据特征数据和标签数据，模拟出函数
X_train_new = fs_sentanalysis.transform(X_train_sentanalysis)
#转换特征向量
svm_clf_sentanaly=sklearn.svm.SVC(gamma='auto') 
svm_clf_sentanaly.fit(X_train_new,Y_train_sentanalysis)