数据可视化漫谈（三）

最新推荐文章于 2022-09-24 19:08:38 发布

风雪夜归子

最新推荐文章于 2022-09-24 19:08:38 发布

阅读量6.5k

点赞数 1

分类专栏：数据可视化文章标签：数据可视化特征选择

本文链接：https://blog.csdn.net/u013719780/article/details/52792955

版权

数据可视化专栏收录该内容

5 篇文章 2 订阅

订阅专栏

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人CSDN博客：http://blog.csdn.net/u013719780?viewmode=contents

数据可视化有助于理解数据，在机器学习项目特征工程阶段也会起到很重要的作用，因此，数据可视化是一个很有必要掌握的武器。本系列博文就对数据可视化进行一些简单的探讨。本文使用Python的seaborn对数据进行可视化。

In [1]:

%matplotlib inline

# standard
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# I've got style,
# miles and miles
import seaborn as sns
sns.set()
sns.set_context('notebook', font_scale=1.5)
cp = sns.color_palette()

Thing 1: Line Chart (with many lines)

In [2]:

ts = pd.read_csv('data/ts.csv')

# casting to datetime is important for
# ensuring plots "just work"
ts = ts.assign(dt = pd.to_datetime(ts.dt))
ts.head()

Out[2]:

	dt	kind	value
0	2000-01-01	A	1.442521
1	2000-01-02	A	1.981290
2	2000-01-03	A	1.586494
3	2000-01-04	A	1.378969
4	2000-01-05	A	-0.277937

In [3]:

# in matplotlib-land, the notion of a "tidy"
# dataframe matters not
dfp = ts.pivot(index='dt', columns='kind', values='value')
dfp.head()

Out[3]:

kind	A	B	C	D
dt
2000-01-01	1.442521	1.808741	0.437415	0.096980
2000-01-02	1.981290	2.277020	0.706127	-1.523108
2000-01-03	1.586494	3.474392	1.358063	-3.100735
2000-01-04	1.378969	2.906132	0.262223	-2.660599
2000-01-05	-0.277937	3.489553	0.796743	-3.417402

In [4]:

g = sns.FacetGrid(ts, hue='kind', size=5, aspect=1.5)
g.map(plt.plot, 'dt', 'value').add_legend()
g.ax.set(xlabel='Date',
         ylabel='Value',
         title='Random Timeseries')
g.fig.autofmt_xdate()

In [5]:

g = sns.FacetGrid(ts, row='kind', hue='kind', size=5, aspect=1.5)
g.map(plt.plot, 'dt', 'value').add_legend()

g.fig.autofmt_xdate()

Thing 2: Scatter

In [6]:

df = pd.read_csv('data/iris.csv')
df.head()

Out[6]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
0	1.4	0.2	5.1	3.5	setosa
1	1.4	0.2	4.9	3.0	setosa
2	1.3	0.2	4.7	3.2	setosa
3	1.5	0.2	4.6	3.1	setosa
4	1.4	0.2	5.0	3.6	setosa

In [7]:

g = sns.FacetGrid(df, hue='species', size=7.5)
g.map(plt.scatter, 'petalLength', 'petalWidth').add_legend()
g.ax.set_title('Petal Width v. Length -- by Species')

Out[7]:

<matplotlib.text.Text at 0x1186b99d0>

Thing 3: Trellising the Above

In [8]:

g = sns.FacetGrid(ts, hue='kind',
                  col='kind', col_wrap=2, size=5)

g.map(plt.plot, 'dt', 'value')
g.fig.autofmt_xdate()
g.fig.suptitle('Random Timeseries', y=1.01)

Out[8]:

<matplotlib.text.Text at 0x11819ead0>

In [9]:

g = sns.FacetGrid(df, col='species', hue='species', size=5)
g.map(plt.scatter, 'petalLength', 'petalWidth')

Out[9]:

<seaborn.axisgrid.FacetGrid at 0x1187474d0>

In [10]:

tmp_n = df.shape[0] - df.shape[0]/2

df['random_factor'] = np.random.permutation(['A'] * tmp_n + ['B'] * (df.shape[0] - tmp_n))
df.head()

Out[10]:

	petalLength	petalWidth	sepalLength	sepalWidth	species	random_factor
0	1.4	0.2	5.1	3.5	setosa	A
1	1.4	0.2	4.9	3.0	setosa	A
2	1.3	0.2	4.7	3.2	setosa	B
3	1.5	0.2	4.6	3.1	setosa	A
4	1.4	0.2	5.0	3.6	setosa	B

In [11]:

g = sns.FacetGrid(df.assign(tmp=df.species + df.random_factor).\
                      sort_values(['species', 'random_factor']),
                  col='species', row='random_factor', hue='tmp', size=5)
g.map(plt.scatter, 'petalLength', 'petalWidth')

Out[11]:

<seaborn.axisgrid.FacetGrid at 0x117ad90d0>

Thing 4: Visualizing Distributions (Boxplot and Histogram)

In [12]:

fig, ax = plt.subplots(1, 1, figsize=(10, 10))

g = sns.boxplot('species', 'petalWidth', data=df, ax=ax)
g.set(title='Distribution of Petal Width by Species')

Out[12]:

[<matplotlib.text.Text at 0x11969c510>]

In [13]:

g = sns.FacetGrid(df, hue='species', size=7.5)

g.map(sns.distplot, 'petalWidth', bins=10,
      kde=False, rug=True).add_legend()

g.set(xlabel='Petal Width',
      ylabel='Frequency',
      title='Distribution of Petal Width by Species')

Out[13]:

<seaborn.axisgrid.FacetGrid at 0x11819e710>

Thing 5: Bar Chart

In [14]:

df = pd.read_csv('data/titanic.csv')
df.head()

Out[14]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

In [15]:

dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})
dfg

Out[15]:

		fare
survived	pclass
0	1	64.684008
	2	19.412328
	3	13.669364
1	1	95.608029
	2	22.055700
	3	13.694887

In [16]:

died = dfg.loc[0, :]
print died

survived = dfg.loc[1, :]
print survived

             fare
pclass           
1       64.684008
2       19.412328
3       13.669364
             fare
pclass           
1       95.608029
2       22.055700
3       13.694887

In [17]:

g = sns.factorplot(x='class', y='fare', hue='survived',
                   data=df, kind='bar',
                   order=['First', 'Second', 'Third'],
                   size=7.5, aspect=1.5, ci=None)
g.ax.set_title('Fare by survival and class')

Out[17]:

<matplotlib.text.Text at 0x11a987b10>

In [18]:

g = sns.factorplot(x='class', y='fare', hue='survived',
                   data=df, kind='bar',
                   order=['First', 'Second', 'Third'],
                   size=7.5, aspect=1.5)
g.ax.set_title('Fare by survival and class')

Out[18]:

<matplotlib.text.Text at 0x11abfaa50>

风雪夜归子

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据可视化漫谈（三）

声明：版权所有，转载请联系作者并注明出处 http://blog.csdn.net/u013719780?viewmode=contents博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台
复制链接

扫一扫