数据可视化漫游(五)

声明:版权所有,转载请联系作者并注明出处  http://blog.csdn.net/u013719780?viewmode=contents


博主简介:风雪夜归子(Allen),机器学习算法攻城狮,喜爱钻研Meachine Learning的黑科技,对Deep Learning和Artificial Intelligence充满兴趣,经常关注Kaggle数据挖掘竞赛平台,对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦,个人CSDN博客:http://blog.csdn.net/u013719780?viewmode=contents



数据可视化有助于理解数据,在机器学习项目特征工程阶段也会起到很重要的作用,因此,数据可视化是一个很有必要掌握的武器。本系列博文就对数据可视化进行一些简单的探讨。本文使用Python的Altair对数据进行可视化。



In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import seaborn as sns
sns.set()
sns.set_context('notebook', font_scale=1.5)
cp = sns.color_palette()
In [3]:
from altair import *


Thing 1: Line Chart (with many lines)


In [55]:
ts = pd.read_csv('data/ts.csv')

ts = ts.assign(dt = pd.to_datetime(ts.dt))
ts.head()
Out[55]:
  dt kind value
0 2000-01-01 A 1.442521
1 2000-01-02 A 1.981290
2 2000-01-03 A 1.586494
3 2000-01-04 A 1.378969
4 2000-01-05 A -0.277937
In [56]:

dfp = ts.pivot(index='dt', columns='kind', values='value')
dfp.head()
Out[56]:
kind A B C D
dt        
2000-01-01 1.442521 1.808741 0.437415 0.096980
2000-01-02 1.981290 2.277020 0.706127 -1.523108
2000-01-03 1.586494 3.474392 1.358063 -3.100735
2000-01-04 1.378969 2.906132 0.262223 -2.660599
2000-01-05 -0.277937 3.489553 0.796743 -3.417402
In [6]:
c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color='kind'
)
c

In [57]:
c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color=Color('kind', scale=Scale(range=cp.as_hex()))
)
c


Thing 2: Scatter


In [7]:
df = pd.read_csv('data/iris.csv')
df.head()
Out[7]:
  petalLength petalWidth sepalLength sepalWidth species
0 1.4 0.2 5.1 3.5 setosa
1 1.4 0.2 4.9 3.0 setosa
2 1.3 0.2 4.7 3.2 setosa
3 1.5 0.2 4.6 3.1 setosa
4 1.4 0.2 5.0 3.6 setosa
In [8]:
c = Chart(df).mark_point(filled=True).encode(
    x='petalLength',
    y='petalWidth',
    color='species'
)
c


Thing 3: Trellising the Above


In [9]:
c = Chart(ts).mark_line().encode(
    x='dt',
    y='value',
    color='kind',
    column='kind'
)
c.configure_cell(height=200, width=200)

In [10]:
c = Chart(df).mark_point().encode(
    x='petalLength',
    y='petalWidth',
    color='species',
    column=Column('species',
                  title='Petal Width v. Length by Species')
)
c.configure_cell(height=300, width=300)

In [11]:
tmp_n = df.shape[0] - df.shape[0]/2

df['random_factor'] = (np.\
                         random.\
                         permutation(['A'] * tmp_n +
                                     ['B'] * (df.shape[0] - tmp_n)))
df.head()
Out[11]:
  petalLength petalWidth sepalLength sepalWidth species random_factor
0 1.4 0.2 5.1 3.5 setosa B
1 1.4 0.2 4.9 3.0 setosa A
2 1.3 0.2 4.7 3.2 setosa A
3 1.5 0.2 4.6 3.1 setosa B
4 1.4 0.2 5.0 3.6 setosa B
In [12]:
c = Chart(df).mark_point().encode(
    x='petalLength',
    y='petalWidth',
    color='species',
    column=Column('species',
                  title='Petal Width v. Length by Species'),
    row='random_factor'
)
c.configure_cell(height=200, width=200)


Thing 4: Visualizing Distributions (Boxplot and Histogram)


In [49]:
# please note: this code is super speculative -- I'm
# assuming there's a better way to do this and I just
# don't know it

c = Chart(df).mark_point(opacity=.5).encode(
    x='species',
    y='petalWidth'
)

c25 = Chart(df).mark_tick(tickThickness=3.0,
                          tickSize=20.0,
                          color='r').encode(
    x='species',
    y='q1(petalWidth)'
)
c50 = Chart(df).mark_tick(tickThickness=3.0,
                          tickSize=20.0,
                          color='r').encode(
    x='species',
    y='median(petalWidth)'
)
c75 = Chart(df).mark_tick(tickThickness=3.0,
                          tickSize=20.0,
                          color='r').encode(
    x='species',
    y='q3(petalWidth)'
)

LayeredChart(data=df, layers=[c, c25, c50, c75])

In [50]:
c = Chart(df).mark_bar(opacity=.75).encode(
    x=X('petalWidth', bin=Bin(maxbins=30)),
    y='count(*)',
    color=Color('species', scale=Scale(range=cp.as_hex()))
)
c


Thing 5: Bar Chart


In [51]:
df = pd.read_csv('data/titanic.csv')
df.head()
Out[51]:
  survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [52]:
dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})
dfg
Out[52]:
    fare
survived pclass  
0 1 64.684008
2 19.412328
3 13.669364
1 1 95.608029
2 22.055700
3 13.694887
In [53]:
died = dfg.loc[0, :]
survived = dfg.loc[1, :]
In [54]:
c = Chart(df).mark_bar().encode(
    x='survived:N',
    y='mean(fare)',
    color='survived:N',
    column='class')
c.configure(
    facet=FacetConfig(cell=CellConfig(strokeWidth=0, height=250))
)


评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值