ubuntu14.04安装scikit-learn

最新推荐文章于 2018-12-17 20:20:28 发布

shuifu1988

最新推荐文章于 2018-12-17 20:20:28 发布

阅读量652

点赞数

python 2.7
1、安装依赖
sudo apt-get install build-essential python-dev python-numpy python-setuptools python-scipy libatlas-dev libatlas3-base
2、配置matplotlib库，进行画图之类的
sudo apt-get install python-matplotlib
3、配置scikit-learn库
sudo apt-get install python-sklearn
4、验证
python
import numpy
import scipy
import matplotlib
import sklearn
不报错，安装成功
示例：
斜线坐标，测试matplotlib

import matplotlib
import numpy
import scipy
import matplotlib.pyplot as plt

plt.plot([1,2,3])
plt.ylabel('some numbers')
plt.show()

桃心程序，测试numpy和matplotlib

import numpy as np
import matplotlib.pyplot as plt

X = np.arange(-5.0, 5.0, 0.1)
Y = np.arange(-5.0, 5.0, 0.1)

x, y = np.meshgrid(X, Y)
f = 17 * x ** 2 - 16 * np.abs(x) * y + 17 * y ** 2 - 225

fig = plt.figure()
cs = plt.contour(x, y, f, 0, colors = 'r')
plt.show()

显示Matplotlib强大绘图交互功能

import numpy as np
import matplotlib.pyplot as plt

N = 5
menMeans = (20, 35, 30, 35, 27)
menStd =   (2, 3, 4, 1, 2)

ind = np.arange(N)  # the x locations for the groups
width = 0.35        # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(ind, menMeans, width, color='r', yerr=menStd)

womenMeans = (25, 32, 34, 20, 25)
womenStd =   (3, 5, 2, 3, 3)
rects2 = ax.bar(ind+width, womenMeans, width, color='y', yerr=womenStd)

# add some
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(ind+width)
ax.set_xticklabels( ('G1', 'G2', 'G3', 'G4', 'G5') )

ax.legend( (rects1[0], rects2[0]), ('Men', 'Women') )

def autolabel(rects):
    # attach some text labels
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x()+rect.get_width()/2., 1.05*height, '%d'%int(height),
                ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

plt.show()

矩阵数据集，测试sklearn

from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
print digits.data

中文分词采用的jieba分词，安装jieba分词包
sudo pip install jieba
计算TF-IDF词语权重，测试scikit-learn数据分析

# coding:utf-8
__author__ = "liuxuejiang"
import jieba
import jieba.posseg as pseg
import os
import sys
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

if __name__ == "__main__":
    corpus=["我 来到 北京 清华大学",     #第一类文本切词后的结果 词之间以空格隔开
        "他 来到 了 网易 杭研 大厦",     #第二类文本的切词结果
        "小明 硕士 毕业 与 中国 科学院",  #第三类文本的切词结果
        "我 爱 北京 天安门"]            #第四类文本的切词结果

    #该类会将文本中的词语转换为词频矩阵，矩阵元素a[i][j] 表示j词在i类文本下的词频
    vectorizer=CountVectorizer()

    #该类会统计每个词语的tf-idf权值
    transformer=TfidfTransformer()

    #第一个fit_transform是计算tf-idf，第二个fit_transform是将文本转为词频矩阵
    tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))

    #获取词袋模型中的所有词语
    word=vectorizer.get_feature_names()

    #将tf-idf矩阵抽取出来，元素a[i][j]表示j词在i类文本中的tf-idf权重
    weight=tfidf.toarray()

    #打印每类文本的tf-idf词语权重，第一个for遍历所有文本，第二个for便利某一类文本下的词语权重
    for i in range(len(weight)):
        print u"-------这里输出第",i,u"类文本的词语tf-idf权重------"
        for j in range(len(word)):
            print word[j],weight[i][j]

运行结果：

-------这里输出第 0 类文本的词语tf-idf权重------
中国 0.0
北京 0.52640543361
大厦 0.0
天安门 0.0
小明 0.0
来到 0.52640543361
杭研 0.0
毕业 0.0
清华大学 0.66767854461
硕士 0.0
科学院 0.0
网易 0.0
-------这里输出第 1 类文本的词语tf-idf权重------
中国 0.0
北京 0.0
大厦 0.525472749264
天安门 0.0
小明 0.0
来到 0.414288751166
杭研 0.525472749264
毕业 0.0
清华大学 0.0
硕士 0.0
科学院 0.0
网易 0.525472749264
-------这里输出第 2 类文本的词语tf-idf权重------
中国 0.4472135955
北京 0.0
大厦 0.0
天安门 0.0
小明 0.4472135955
来到 0.0
杭研 0.0
毕业 0.4472135955
清华大学 0.0
硕士 0.4472135955
科学院 0.4472135955
网易 0.0
-------这里输出第 3 类文本的词语tf-idf权重------
中国 0.0
北京 0.61913029649
大厦 0.0
天安门 0.78528827571
小明 0.0
来到 0.0
杭研 0.0
毕业 0.0
清华大学 0.0
硕士 0.0
科学院 0.0
网易 0.0