数据可视化漫游（五）

最新推荐文章于 2020-08-29 20:40:22 发布

风雪夜归子

最新推荐文章于 2020-08-29 20:40:22 发布

阅读量5.6k

点赞数 2

分类专栏：数据可视化文章标签：数据可视化特征选择

本文链接：https://blog.csdn.net/u013719780/article/details/52795684

版权

数据可视化专栏收录该内容

5 篇文章 2 订阅

订阅专栏

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人CSDN博客：http://blog.csdn.net/u013719780?viewmode=contents

数据可视化有助于理解数据，在机器学习项目特征工程阶段也会起到很重要的作用，因此，数据可视化是一个很有必要掌握的武器。本系列博文就对数据可视化进行一些简单的探讨。本文使用Python的Altair对数据进行可视化。

In [2]:

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import seaborn as sns
sns.set()
sns.set_context('notebook', font_scale=1.5)
cp = sns.color_palette()

In [3]:

from altair import *

Thing 1: Line Chart (with many lines)

In [55]:

ts = pd.read_csv('data/ts.csv')

ts = ts.assign(dt = pd.to_datetime(ts.dt))
ts.head()

Out[55]:

	dt	kind	value
0	2000-01-01	A	1.442521
1	2000-01-02	A	1.981290
2	2000-01-03	A	1.586494
3	2000-01-04	A	1.378969
4	2000-01-05	A	-0.277937

In [56]:


dfp = ts.pivot(index='dt', columns='kind', values='value')
dfp.head()

Out[56]:

kind	A	B	C	D
dt
2000-01-01	1.442521	1.808741	0.437415	0.096980
2000-01-02	1.981290	2.277020	0.706127	-1.523108
2000-01-03	1.586494	3.474392	1.358063	-3.100735
2000-01-04	1.378969	2.906132	0.262223	-2.660599
2000-01-05	-0.277937	3.489553	0.796743	-3.417402

In [6]:

c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color='kind'
)
c

In [57]:

c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color=Color('kind', scale=Scale(range=cp.as_hex()))
)
c

Thing 2: Scatter

In [7]:

df = pd.read_csv('data/iris.csv')
df.head()

Out[7]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
0	1.4	0.2	5.1	3.5	setosa
1	1.4	0.2	4.9	3.0	setosa
2	1.3	0.2	4.7	3.2	setosa
3	1.5	0.2	4.6	3.1	setosa
4	1.4	0.2	5.0	3.6	setosa

In [8]:

c = Chart(df).mark_point(filled=True).encode(
    x='petalLength',
    y='petalWidth',
    color='species'
)
c

Thing 3: Trellising the Above

In [9]:

c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color='kind',
    column='kind'
)
c.configure_cell(height=200, width=200)

In [10]:

c = Chart(df).mark_point().encode(
    x='petalLength',
    y='petalWidth',
    color='species',
    column=Column('species',
                  title='Petal Width v. Length by Species')
)
c.configure_cell(height=300, width=300)

In [11]:

tmp_n = df.shape[0] - df.shape[0]/2

df['random_factor'] = (np.\
                         random.\
                         permutation(['A'] * tmp_n +
                                     ['B'] * (df.shape[0] - tmp_n)))
df.head()

Out[11]:

	petalLength	petalWidth	sepalLength	sepalWidth	species	random_factor
0	1.4	0.2	5.1	3.5	setosa	B
1	1.4	0.2	4.9	3.0	setosa	A
2	1.3	0.2	4.7	3.2	setosa	A
3	1.5	0.2	4.6	3.1	setosa	B
4	1.4	0.2	5.0	3.6	setosa	B

In [12]:

c = Chart(df).mark_point().encode(
    x='petalLength',
    y='petalWidth',
    color='species',
    column=Column('species',
                  title='Petal Width v. Length by Species'),
    row='random_factor'
)
c.configure_cell(height=200, width=200)

Thing 4: Visualizing Distributions (Boxplot and Histogram)

In [49]:

# please note: this code is super speculative -- I'm
# assuming there's a better way to do this and I just
# don't know it

c = Chart(df).mark_point(opacity=.5).encode(
    x='species',
    y='petalWidth'
)

c25 = Chart(df).mark_tick(tickThickness=3.0,
                          tickSize=20.0,
                          color='r').encode(
    x='species',
    y='q1(petalWidth)'
)
c50 = Chart(df).mark_tick(tickThickness=3.0,
                          tickSize=20.0,
                          color='r').encode(
    x='species',
    y='median(petalWidth)'
)
c75 = Chart(df).mark_tick(tickThickness=3.0,
                          tickSize=20.0,
                          color='r').encode(
    x='species',
    y='q3(petalWidth)'
)

LayeredChart(data=df, layers=[c, c25, c50, c75])

In [50]:

c = Chart(df).mark_bar(opacity=.75).encode(
    x=X('petalWidth', bin=Bin(maxbins=30)),
    y='count(*)',
    color=Color('species', scale=Scale(range=cp.as_hex()))
)
c

Thing 5: Bar Chart

In [51]:

df = pd.read_csv('data/titanic.csv')
df.head()

Out[51]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

In [52]:

dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})
dfg

Out[52]:

		fare
survived	pclass
0	1	64.684008
	2	19.412328
	3	13.669364
1	1	95.608029
	2	22.055700
	3	13.694887

In [53]:

died = dfg.loc[0, :]
survived = dfg.loc[1, :]

In [54]:

c = Chart(df).mark_bar().encode(
    x='survived:N',
    y='mean(fare)',
    color='survived:N',
    column='class')
c.configure(
    facet=FacetConfig(cell=CellConfig(strokeWidth=0, height=250))
)