声明:版权所有,转载请联系作者并注明出处 http://blog.csdn.net/u013719780?viewmode=contents
博主简介:风雪夜归子(Allen),机器学习算法攻城狮,喜爱钻研Meachine Learning的黑科技,对Deep Learning和Artificial Intelligence充满兴趣,经常关注Kaggle数据挖掘竞赛平台,对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦,个人CSDN博客:http://blog.csdn.net/u013719780?viewmode=contents
数据可视化有助于理解数据,在机器学习项目特征工程阶段也会起到很重要的作用,因此,数据可视化是一个很有必要掌握的武器。本系列博文就对数据可视化进行一些简单的探讨。本文使用Python的ggplot对数据进行可视化。
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
sns.set_context('notebook', font_scale=1.0)
cp = sns.color_palette()
In [2]:
from ggplot import *
In [3]:
ts = pd.read_csv('data/ts.csv')
ts = ts.assign(dt = pd.to_datetime(ts.dt))
ts.head()
Out[3]:
In [4]:
dfp = ts.pivot(index='dt', columns='kind', values='value')
dfp.head()
Out[4]:
In [5]:
fig, ax = plt.subplots(1, 1, figsize=(7.5, 5))
g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
geom_line(size=2.0) + \
xlab('Date') + \
ylab('Value') + \
ggtitle('Random Timeseries')
g
Out[5]:
In [6]:
df = pd.read_csv('data/iris.csv')
df.head()
Out[6]:
In [8]:
g = ggplot(df, aes(x='petalLength',
y='petalWidth',
color='species')) + \
xlab('petalLength') + \ # 可以省略
ylab('petalWidth') + \ # 可以省略
geom_point(size=40.0) + \
ggtitle('Petal Width v. Length -- by Species')
g
Out[8]:
In [15]:
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
geom_line(size=2.0) + \
facet_wrap(x='kind', ncol=2) + \
ggtitle('Random Timeseries')
g
Out[15]:
In [19]:
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
geom_line(size=2.0) + \
facet_wrap(y='kind') + \
ggtitle('Random Timeseries')
g
Out[19]:
In [26]:
g = ggplot(df, aes(x='petalLength',
y='petalWidth',
color='species')) + \
facet_grid(y='species') + \
geom_point(size=40.0)
g
Out[26]:
In [23]:
tmp_n = df.shape[0] - df.shape[0]/2
df['random_factor'] = np.random.permutation(['A'] * tmp_n + ['B'] * (df.shape[0] - tmp_n))
df.head()
Out[23]:
In [24]:
g = ggplot(df, aes(x='petalLength',
y='petalWidth',
color='species')) + \
facet_grid(x='random_factor', y='species') + \
geom_point(size=40.0)
g
Out[24]:
In [27]:
g = ggplot(df, aes(x='species',
y='petalWidth',
fill='species')) + \
geom_boxplot() + \
ggtitle('Distribution of Petal Width by Species')
g
Out[27]:
In [28]:
g = ggplot(df, aes(x='petalWidth',
fill='species')) + \
geom_histogram() + \
ylab('Frequency') + \
ggtitle('Distribution of Petal Width by Species')
g
Out[28]:
In [29]:
df = pd.read_csv('data/titanic.csv')
df.head()
Out[29]:
In [30]:
dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})
dfg
Out[30]:
In [38]:
g = ggplot(df, aes(x='class', y='fare')) + \
geom_bar()
g
Out[38]:
In [39]:
g = ggplot(df, aes(x='class', weight='fare')) + \
geom_bar()
g
Out[39]:
In [40]:
df.groupby(['class', 'survived']).\
agg({'fare': 'mean'}).\
reset_index()
Out[40]:
In [41]:
g = ggplot(df.groupby(['class', 'survived']).\
agg({'fare': 'mean'}).\
reset_index(), aes(x='class',
fill='factor(survived)',
weight='fare',
y='fare')) + \
geom_bar() + \
ylab('Avg. Fare') + \
xlab('Class') + \
ggtitle('Fare by survival and class')
g
Out[41]:
In [42]:
g = ggplot(df.groupby(['class', 'survived']).\
agg({'fare': 'mean'}).\
reset_index(), aes(x='class',
fill='factor(survived)',
y='fare')) + \
geom_bar() + \
ylab('Avg. Fare') + \
xlab('Class') + \
ggtitle('Fare by survival and class')
g
Out[42]:
In [ ]:
# # in R, I believe you'd do something like this:
ggplot(df, aes(x=factor(survived), y=fare)) +
stat_summary_bin(aes(fill=factor(survived)),
fun.y="mean",
geom="bar") +
facet_wrap(~class)
# # damn ggplot2 is awesome...