数据可视化漫谈（四）

最新推荐文章于 2022-08-31 10:25:34 发布

风雪夜归子

最新推荐文章于 2022-08-31 10:25:34 发布

阅读量5.8k

点赞数 1

分类专栏：数据可视化文章标签：数据可视化特征选择

本文链接：https://blog.csdn.net/u013719780/article/details/52793001

版权

数据可视化专栏收录该内容

5 篇文章 2 订阅

订阅专栏

博主简介：风雪夜归子（Allen），机器学习算法攻城狮，喜爱钻研Meachine Learning的黑科技，对Deep Learning和Artificial Intelligence充满兴趣，经常关注Kaggle数据挖掘竞赛平台，对数据、Machine Learning和Artificial Intelligence有兴趣的童鞋可以一起探讨哦，个人CSDN博客：http://blog.csdn.net/u013719780?viewmode=contents

数据可视化有助于理解数据，在机器学习项目特征工程阶段也会起到很重要的作用，因此，数据可视化是一个很有必要掌握的武器。本系列博文就对数据可视化进行一些简单的探讨。本文使用Python的ggplot对数据进行可视化。

In [1]:

%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

import seaborn as sns
sns.set()
sns.set_context('notebook', font_scale=1.0)
cp = sns.color_palette()

In [2]:

from ggplot import *

Thing 1: Line Chart (with many lines)

In [3]:

ts = pd.read_csv('data/ts.csv')

ts = ts.assign(dt = pd.to_datetime(ts.dt))
ts.head()

Out[3]:

	dt	kind	value
0	2000-01-01	A	1.442521
1	2000-01-02	A	1.981290
2	2000-01-03	A	1.586494
3	2000-01-04	A	1.378969
4	2000-01-05	A	-0.277937

In [4]:


dfp = ts.pivot(index='dt', columns='kind', values='value')
dfp.head()

Out[4]:

kind	A	B	C	D
dt
2000-01-01	1.442521	1.808741	0.437415	0.096980
2000-01-02	1.981290	2.277020	0.706127	-1.523108
2000-01-03	1.586494	3.474392	1.358063	-3.100735
2000-01-04	1.378969	2.906132	0.262223	-2.660599
2000-01-05	-0.277937	3.489553	0.796743	-3.417402

In [5]:

fig, ax = plt.subplots(1, 1, figsize=(7.5, 5))

g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
        geom_line(size=2.0) + \
        xlab('Date') + \
        ylab('Value') + \
        ggtitle('Random Timeseries')
g

Out[5]:

<ggplot: (291631493)>

Thing 2: Scatter

In [6]:

df = pd.read_csv('data/iris.csv')
df.head()

Out[6]:

	petalLength	petalWidth	sepalLength	sepalWidth	species
0	1.4	0.2	5.1	3.5	setosa
1	1.4	0.2	4.9	3.0	setosa
2	1.3	0.2	4.7	3.2	setosa
3	1.5	0.2	4.6	3.1	setosa
4	1.4	0.2	5.0	3.6	setosa

In [8]:

g = ggplot(df, aes(x='petalLength',
                   y='petalWidth',
                   color='species')) + \
                   xlab('petalLength') + \   # 可以省略
                   ylab('petalWidth') + \    # 可以省略
                   geom_point(size=40.0) + \
                   ggtitle('Petal Width v. Length -- by Species')
g

Out[8]:

<ggplot: (290147369)>

Thing 3: Trellising the Above

In [15]:

fig, ax = plt.subplots(2, 2, figsize=(10, 10))

g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
        geom_line(size=2.0) + \
        facet_wrap(x='kind', ncol=2) + \
        ggtitle('Random Timeseries')
g

Out[15]:

<ggplot: (296670837)>

In [19]:

fig, ax = plt.subplots(2, 2, figsize=(10, 10))

g = ggplot(ts, aes(x='dt', y='value', color='kind')) + \
        geom_line(size=2.0) + \
        facet_wrap(y='kind') + \
        ggtitle('Random Timeseries')
g

Out[19]:

<ggplot: (302569745)>

In [26]:

g = ggplot(df, aes(x='petalLength',
                   y='petalWidth',
                   color='species')) + \
        facet_grid(y='species') + \
        geom_point(size=40.0)
g

Out[26]:

<ggplot: (304777329)>

In [23]:

tmp_n = df.shape[0] - df.shape[0]/2

df['random_factor'] = np.random.permutation(['A'] * tmp_n + ['B'] * (df.shape[0] - tmp_n))
df.head()

Out[23]:

	petalLength	petalWidth	sepalLength	sepalWidth	species	random_factor
0	1.4	0.2	5.1	3.5	setosa	B
1	1.4	0.2	4.9	3.0	setosa	B
2	1.3	0.2	4.7	3.2	setosa	B
3	1.5	0.2	4.6	3.1	setosa	A
4	1.4	0.2	5.0	3.6	setosa	A

In [24]:

g = ggplot(df, aes(x='petalLength',
                   y='petalWidth',
                   color='species')) + \
        facet_grid(x='random_factor', y='species') + \
        geom_point(size=40.0)
g

Out[24]:

<ggplot: (303206949)>

Thing 4: Visualizing Distributions (Boxplot and Histogram)

In [27]:

g = ggplot(df, aes(x='species',
                   y='petalWidth',
                   fill='species')) + \
        geom_boxplot() + \
        ggtitle('Distribution of Petal Width by Species')
g

Out[27]:

<ggplot: (305209657)>

In [28]:

g = ggplot(df, aes(x='petalWidth',
                   fill='species')) + \
        geom_histogram() + \
        ylab('Frequency') + \
        ggtitle('Distribution of Petal Width by Species')
g

Out[28]:

<ggplot: (305212341)>

Thing 5: Bar Chart

In [29]:

df = pd.read_csv('data/titanic.csv')
df.head()

Out[29]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

In [30]:

dfg = df.groupby(['survived', 'pclass']).agg({'fare': 'mean'})
dfg

Out[30]:

		fare
survived	pclass
0	1	64.684008
	2	19.412328
	3	13.669364
1	1	95.608029
	2	22.055700
	3	13.694887

In [38]:

g = ggplot(df, aes(x='class', y='fare')) + \
        geom_bar()
g

Out[38]:

<ggplot: (306316705)>

In [39]:

g = ggplot(df, aes(x='class', weight='fare')) + \
        geom_bar()
g

Out[39]:

<ggplot: (306376773)>

In [40]:

df.groupby(['class', 'survived']).\
               agg({'fare': 'mean'}).\
               reset_index()

Out[40]:

	class	survived	fare
0	First	0	64.684008
1	First	1	95.608029
2	Second	0	19.412328
3	Second	1	22.055700
4	Third	0	13.669364
5	Third	1	13.694887

In [41]:

g = ggplot(df.groupby(['class', 'survived']).\
               agg({'fare': 'mean'}).\
               reset_index(), aes(x='class',
                                  fill='factor(survived)',
                                  weight='fare',
                                  y='fare')) + \
        geom_bar() + \
        ylab('Avg. Fare') + \
        xlab('Class') + \
        ggtitle('Fare by survival and class') 
g

/Applications/anaconda/lib/python2.7/site-packages/ggplot/ggplot.py:602: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  fill_levels = self.data[[fillcol_raw, fillcol]].sort(fillcol_raw)[fillcol].unique()

Out[41]:

<ggplot: (306505389)>

In [42]:

g = ggplot(df.groupby(['class', 'survived']).\
               agg({'fare': 'mean'}).\
               reset_index(), aes(x='class',
                                  fill='factor(survived)',
                                  y='fare')) + \
        geom_bar() + \
        ylab('Avg. Fare') + \
        xlab('Class') + \
        ggtitle('Fare by survival and class') 
g

Out[42]:

<ggplot: (306638529)>

In [ ]:

# # in R, I believe you'd do something like this:

ggplot(df, aes(x=factor(survived), y=fare)) +
    stat_summary_bin(aes(fill=factor(survived)),
                     fun.y="mean",
                     geom="bar") +
    facet_wrap(~class)
    
# # damn ggplot2 is awesome...