PythonMachineLearning-Chap4.CodeExample

[Sebastian Raschka](http://sebastianraschka.com), 2015 https://github.com/rasbt/python-machine-learning-book

Python Machine Learning Essentials - Code Examples

Chapter 4 - Building Good Training Sets – Data Pre-Processing

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).

%load_ext watermark
%watermark -a 'Sebastian Raschka' -u -d -v -p numpy,pandas,matplotlib,scikit-learn
The watermark extension is already loaded. To reload it, use: %reload_ext watermark Sebastian Raschka last updated: 2016-03-25 CPython 3.5.1 IPython 4.0.3 numpy 1.10.4 pandas 0.17.1 matplotlib 1.5.1 scikit-learn 0.17.1
# to install watermark just uncomment the following line:
#%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py


Overview



from IPython.display import Image

Dealing with missing data

import pandas as pd
from io import StringIO

csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

# If you are using Python 2.7, you need
# to convert the string to unicode:
# csv_data = unicode(csv_data)

df = pd.read_csv(StringIO(csv_data))
df
ABCD
01234
156NaN8
2101112NaN
df.isnull().sum()
A 0 B 0 C 1 D 1 dtype: int64

Eliminating samples or features with missing values

df.dropna()
ABCD
01234
df.dropna(axis=1)
AB
012
156
21011
#only drop rows where all columns are NaN
df.dropna(how='all')  
ABCD
01234
156NaN8
2101112NaN
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)
ABCD
01234
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])
ABCD
01234
2101112NaN



Imputing missing values

from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0) #缺失数据NaN,插值策略:均值,axis=0代表计算列均值,如果置1,则代表行均值
imr = imr.fit(df) #调用fit方法对df进行处理
imputed_data = imr.transform(df.values) #插值后的数据
imputed_data
output:
    array([[  1. ,   2. ,   3. ,   4. ],
           [  5. ,   6. ,   7.5,   8. ],
           [ 10. ,  11. ,  12. ,   6. ]])
input:
df.values
output:
    array([[  1.,   2.,   3.,   4.],
           [  5.,   6.,  nan,   8.],
           [ 10.,  11.,  12.,  nan]])


Understanding the scikit-learn estimator API

Image(filename='./images/04_04.png', width=400) 
Image(filename='./images/04_05.png', width=400) 

Handling categorical data

import pandas as pd
df = pd.DataFrame([
            ['green', 'M', 10.1, 'class1'], 
            ['red', 'L', 13.5, 'class2'], 
            ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df
colorsizepriceclasslabel
0greenM10.1class1
1redL13.5class2
2blueXL15.3class1

Mapping ordinal features

size_mapping = {
           'XL': 3,
           'L': 2,
           'M': 1}

df['size'] = df['size'].map(size_mapping)
df
colorsizepriceclasslabel
0green110.1class1
1red213.5class2
2blue315.3class1
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)
0 M 1 L 2 XL Name: size, dtype: object

Encoding class labels

import numpy as np

class_mapping = {label:idx for idx,label in enumerate(np.unique(df['classlabel']))}
class_mapping
{‘class1’: 0, ‘class2’: 1}
df['classlabel'] = df['classlabel'].map(class_mapping) #标签量化为数值
df
colorsizepriceclasslabel
0green110.10
1red213.51
2blue315.30
inv_class_mapping = {v: k for k, v in class_mapping.items()} #数值逆转为标签 就是找到对应键值啦~
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df
colorsizepriceclasslabel
0green110.1class1
1red213.5class2
2blue315.3class1
from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y
array([0, 1, 0])
class_le.inverse_transform(y)
array([‘class1’, ‘class2’, ‘class1’], dtype=object)

Performing one-hot encoding on nominal features

X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X
array([[1, 1, 10.1], [2, 2, 13.5], [0, 3, 15.3]], dtype=object)
from sklearn.preprocessing import OneHotEncoder 
"""对于nominal特征,我们如果单纯的设定数值给它,会让算法误以为他们之间有大小之差。所以要采用其他办法来表示。
比如颜色特征有三种,红色、蓝色、黄色,不能单纯的设定红=0,蓝=1,黄=2,create a new dummy feature for each
unique value in the nominal feature column,蓝色我们可以认为红=0,蓝=1,黄=0,即[0,1,0]。
这种方法叫做OneHotEncoder,可以直接调用
"""
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()
array([[ 0. , 1. , 0. , 1. , 10.1], [ 0. , 0. , 1. , 2. , 13.5], [ 1. , 0. , 0. , 3. , 15.3]])
pd.get_dummies(df[['price', 'color', 'size']])
pricesizecolor_bluecolor_greencolor_red
010.11010
113.52001
215.33100



Partitioning a dataset in training and test sets

df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 
'Alcalinity of ash', 'Magnesium', 'Total phenols', 
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()
Class labels [1 2 3]
Class labelAlcoholMalic acidAshAlcalinity of ashMagnesiumTotal phenolsFlavanoidsNonflavanoid phenolsProanthocyaninsColor intensityHueOD280/OD315 of diluted winesProline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735

Note:

If the link to the Wine dataset provided above does not work for you, you can find a local copy in this repository at ./../datasets/wine/wine.data.

Or you could fetch it via

df_wine = pd.read_csv('https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wine/wine.data', header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 
'Alcalinity of ash', 'Magnesium', 'Total phenols', 
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df_wine.head()
Class labelAlcoholMalic acidAshAlcalinity of ashMagnesiumTotal phenolsFlavanoidsNonflavanoid phenolsProanthocyaninsColor intensityHueOD280/OD315 of diluted winesProline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735

from sklearn.cross_validation import train_test_split

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.3, random_state=0) #划分训练集和测试集,设定测试集比例


Bringing features onto the same scale

#几乎所有的机器学习算法都要永奥特征尺度变化,除了决策树和随机森林
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)
A visual example:
ex = pd.DataFrame([0, 1, 2 ,3, 4, 5])

# standardize
ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)

# Please note that pandas uses ddof=1 (sample standard deviation) 
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)

# normalize
ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min())
ex.columns = ['input', 'standardized', 'normalized']
ex
inputstandardizednormalized
00-1.463850.0
11-0.878310.2
22-0.292770.4
330.292770.6
440.878310.8
551.463851.0



Selecting meaningful features

Sparse solutions with L1-regularization

Image(filename='./images/04_12.png', width=500) 

png

Image(filename='./images/04_13.png', width=500) 

png

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l1', C=0.1)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))
Training accuracy: 0.983870967742
Test accuracy: 0.981481481481
lr.intercept_
array([-0.38378555, -0.15806701, -0.70042279])
lr.coef_
array([[ 0.28029903,  0.        ,  0.        , -0.02796754,  0.        ,
         0.        ,  0.71009425,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.2359698 ],
       [-0.64405349, -0.0688069 , -0.05719933,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        , -0.9266375 ,
         0.06023396,  0.        , -0.37104604],
       [ 0.        ,  0.06148928,  0.        ,  0.        ,  0.        ,
         0.        , -0.63612386,  0.        ,  0.        ,  0.49811303,
        -0.35820494, -0.57119385,  0.        ]])
import matplotlib.pyplot as plt
%matplotlib inline

fig = plt.figure()
ax = plt.subplot(111)

colors = ['blue', 'green', 'red', 'cyan', 
         'magenta', 'yellow', 'black', 
          'pink', 'lightgreen', 'lightblue', 
          'gray', 'indigo', 'orange']

weights, params = [], []
for c in np.arange(-4, 6):
    lr = LogisticRegression(penalty='l1', C=10**c, random_state=0)
    lr.fit(X_train_std, y_train)
    weights.append(lr.coef_[1])
    params.append(10**c)

weights = np.array(weights)

for column, color in zip(range(weights.shape[1]), colors):
    plt.plot(params, weights[:, column],
             label=df_wine.columns[column+1],
             color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center', 
          bbox_to_anchor=(1.38, 1.03),
          ncol=1, fancybox=True)
# plt.savefig('./figures/l1_path.png', dpi=300)
plt.show()

png



Sequential feature selection algorithms

from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score  #评价指标:accuracy_score

class SBS():
    def __init__(self, estimator, k_features, scoring=accuracy_score,  
                 test_size=0.25, random_state=1):
        self.scoring = scoring
        self.estimator = clone(estimator)
        self.k_features = k_features
        self.test_size = test_size
        self.random_state = random_state

    def fit(self, X, y):

        X_train, X_test, y_train, y_test = \
                train_test_split(X, y, test_size=self.test_size, 
                                 random_state=self.random_state)

        dim = X_train.shape[1]
        self.indices_ = tuple(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X_train, y_train, 
                                 X_test, y_test, self.indices_)
        self.scores_ = [score]

        while dim > self.k_features:
            scores = []
            subsets = []

            for p in combinations(self.indices_, r=dim-1):
                score = self._calc_score(X_train, y_train, 
                                         X_test, y_test, p)
                scores.append(score)
                subsets.append(p)

            best = np.argmax(scores)
            self.indices_ = subsets[best]
            self.subsets_.append(self.indices_)
            dim -= 1

            self.scores_.append(scores[best])
        self.k_score_ = self.scores_[-1]

        return self

    def transform(self, X):
        return X[:, self.indices_]

    def _calc_score(self, X_train, y_train, X_test, y_test, indices):
        self.estimator.fit(X_train[:, indices], y_train)
        y_pred = self.estimator.predict(X_test[:, indices])
        score = self.scoring(y_test, y_pred)
        return score
%matplotlib inline
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

knn = KNeighborsClassifier(n_neighbors=2)

# selecting features
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)

# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]

plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.1])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
# plt.savefig('./sbs.png', dpi=300)
plt.show()

png

k5 = list(sbs.subsets_[8])
print(df_wine.columns[1:][k5])
Index(['Alcohol', 'Malic acid', 'Alcalinity of ash', 'Hue', 'Proline'], dtype='object')
knn.fit(X_train_std, y_train)
print('Training accuracy:', knn.score(X_train_std, y_train))
print('Test accuracy:', knn.score(X_test_std, y_test))
Training accuracy: 0.983870967742
Test accuracy: 0.944444444444
knn.fit(X_train_std[:, k5], y_train)
print('Training accuracy:', knn.score(X_train_std[:, k5], y_train))
print('Test accuracy:', knn.score(X_test_std[:, k5], y_test))
Training accuracy: 0.959677419355
Test accuracy: 0.962962962963



Assessing Feature Importances with Random Forests

from sklearn.ensemble import RandomForestClassifier

feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=10000,
                                random_state=0,
                                n_jobs=-1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, 
                            feat_labels[indices[f]], 
                            importances[indices[f]]))

plt.title('Feature Importances')
plt.bar(range(X_train.shape[1]), 
        importances[indices],
        color='lightblue', 
        align='center')

plt.xticks(range(X_train.shape[1]), 
           feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('./random_forest.png', dpi=300)
plt.show()
 1) Color intensity                0.182483
 2) Proline                        0.158610
 3) Flavanoids                     0.150948
 4) OD280/OD315 of diluted wines   0.131987
 5) Alcohol                        0.106589
 6) Hue                            0.078243
 7) Total phenols                  0.060718
 8) Alcalinity of ash              0.032033
 9) Malic acid                     0.025400
10) Proanthocyanins                0.022351
11) Magnesium                      0.022078
12) Nonflavanoid phenols           0.014645
13) Ash                            0.013916

png

X_selected = forest.transform(X_train, threshold=0.15)
X_selected.shape
/Users/Sebastian/miniconda3/lib/python3.5/site-packages/sklearn/utils/__init__.py:93: DeprecationWarning: Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.
  warnings.warn(msg, category=DeprecationWarning)





(124, 3)



Summary

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Data science and machine learning are some of the top buzzwords in the technical world today. A resurging interest in machine learning is due to the same factors that have made data mining and Bayesian analysis more popular than ever. This book is your entry point to machine learning. Chapter 1, Getting Started with Python and Machine Learning, is the starting point for someone who is looking forward to enter the field of ML with Python. You will get familiar with the basics of Python and ML in this chapter and set up the software on your machine. Chapter 2, Exploring the 20 Newsgroups Dataset with Text Analysis Algorithms, explains important concepts such as getting the data, its features, and pre-processing. It also covers the dimension reduction technique, principal component analysis, and the k-nearest neighbors algorithm. Chapter 3, Spam Email Detection with Naive Bayes, covers classification, naive Bayes, and its in-depth implementation, classification performance evaluation, model selection and tuning, and cross-validation. Examples such as spam e-mail detection are demonstrated. Chapter 4, News Topic Classification with Support Vector Machine, covers multiclass classification, Support Vector Machine, and how it is applied in topic classification. Other important concepts, such as kernel machine, overfitting, and regularization, are discussed as well. Chapter 5, Click-Through Prediction with Tree-Based Algorithms, explains decision trees and random forests in depth over the course of solving an advertising click-through rate problem. Chapter 6, Click-Through Prediction with Logistic Regression, explains in depth the logistic regression classifier. Also, concepts such as categorical variable encoding, L1 and L2 regularization, feature selection, online learning, and stochastic gradient descent are detailed. Chapter 7, Stock Price Prediction with Regression Algorithms, analyzes predicting stock market prices using Yahoo/Google Finance data and maybe addit
Python Machine Learning By Example by Yuxi (Hayden) Liu English | 31 May 2017 | ASIN: B01MT7ATL5 | 254 Pages | AZW3 | 3.86 MB Key Features Learn the fundamentals of machine learning and build your own intelligent applications Master the art of building your own machine learning systems with this example-based practical guide Work with important classification and regression algorithms and other machine learning techniques Book Description Data science and machine learning are some of the top buzzwords in the technical world today. A resurging interest in machine learning is due to the same factors that have made data mining and Bayesian analysis more popular than ever. This book is your entry point to machine learning. This book starts with an introduction to machine learning and the Python language and shows you how to complete the setup. Moving ahead, you will learn all the important concepts such as, exploratory data analysis, data preprocessing, feature extraction, data visualization and clustering, classification, regression and model performance evaluation. With the help of various projects included, you will find it intriguing to acquire the mechanics of several important machine learning algorithms – they are no more obscure as they thought. Also, you will be guided step by step to build your own models from scratch. Toward the end, you will gather a broad picture of the machine learning ecosystem and best practices of applying machine learning techniques. Through this book, you will learn to tackle data-driven problems and implement your solutions with the powerful yet simple language, Python. Interesting and easy-to-follow examples, to name some, news topic classification, spam email detection, online ad click-through prediction, stock prices forecast, will keep you glued till you reach your goal. What you will learn Exploit the power of Python to handle data extraction, manipulation, and exploration techniques Use Python to visualize data spread across multiple dimensions and extract useful features Dive deep into the world of analytics to predict situations correctly Implement machine learning classification and regression algorithms from scratch in Python Be amazed to see the algorithms in action Evaluate the performance of a machine learning model and optimize it Solve interesting real-world problems using machine learning and Python as the journey unfolds About the Author Yuxi (Hayden) Liu is currently a data scientist working on messaging app optimization at a multinational online media corporation in Toronto, Canada. He is focusing on social graph mining, social personalization, user demographics and interests prediction, spam detection, and recommendation systems. He has worked for a few years as a data scientist at several programmatic advertising companies, where he applied his machine learning expertise in ad optimization, click-through rate and conversion rate prediction, and click fraud detection. Yuxi earned his degree from the University of Toronto, and published five IEEE transactions and conference papers during his master's research. He finds it enjoyable to crawl data from websites and derive valuable insights. He is also an investment enthusiast. Table of Contents Getting Started with Python and Machine Learning Exploring the 20 newsgroups data set Spam email detection with Naive Bayes News topic classification with Support Vector Machine Click-through prediction with tree-based algorithms Click-through rate prediction with logistic regression Stock prices prediction with regression algorithms Best practices
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值