可视化第一部分（简单的图形）

最新推荐文章于 2022-08-15 15:32:22 发布

wei_liao

最新推荐文章于 2022-08-15 15:32:22 发布

阅读量2.1k

点赞数

EDA之数据的可视化

自己也没想到可视化拖了这么久，有些python包真的很强大，但学起来也很复杂，挺有难度的。所以我打算从我们经常遇到的数据框角度出发来做数据的可视化。

一、pandas的可视化

这种操作可以完成一些简单的可视化，可以直接传入kind类型或者直接在plot.后面加入需要的画图类型，传入需要的参数，对于数据框，散点图需要传入x，y，箱线图直接传入数据框时是讲所有的连续变量描述出来的，如果需要对分类变量提前划分可以groupby或者传入参数by=col，对于line图数据是一个series,是plot的默认选择，对于条形图也就是柱状图来说，他会对数据框的每行每列都作图，高度就是对应的列值，其实来说并不方便日常使用。直方图来说也很简单，主要就是方向，ticks，bins等参数很好理解。

总的来说pandas的可视化相对来说更适合我们一般简单的操作，不需要太复杂的标注和配色等问题，主要就是单一变量的分析，更适合低维数据的可视化。

这个主要还是基于matplotlib

主要参数：详见：help(pd.series.plot)

Methods defined here:

| __call__(self, kind='line', ax=None, figsize=None, use_index=True, title=None, grid=None, legend=False, style=None, logx=False, logy=False, loglog=False, xticks=None, yticks=None, xlim=None, ylim=None, rot=None, fontsize=None, colormap=None, table=False, yerr=None, xerr=None, label=None, secondary_y=False, **kwds)

| Make plots of Series using matplotlib / pylab.

Parameters

| ----------

| data : Series

| kind : str

| - 'line' : line plot (default)

| - 'bar' : vertical bar plot

| - 'barh' : horizontal bar plot

| - 'hist' : histogram

| - 'box' : boxplot

| - 'kde' : Kernel Density Estimation plot

| - 'density' : same as 'kde'

| - 'area' : area plot

| - 'pie' : pie plot

| ax : matplotlib axes object

| If not passed, uses gca()

| figsize : a tuple (width, height) in inches

| use_index : boolean, default True

| Use index as ticks for x axis

| title : string or list

| Title to use for the plot. If a string is passed, print the string at

| the top of the figure. If a list is passed and `subplots` is True,

| print each item in the list above the corresponding subplot.

| grid : boolean, default None (matlab style default)

| Axis grid lines

| legend : False/True/'reverse'

| Place legend on axis subplots

| style : list or dict

| matplotlib line style per column

| xticks : sequence

| Values to use for the xticks

| yticks : sequence

| Values to use for the yticks

| xlim : 2-tuple/list

| ylim : 2-tuple/list

| fontsize : int, default None

| Font size for xticks and yticks

| colormap : str or matplotlib colormap object, default None

| Colormap to select colors from. If string, load colormap with that name

| from matplotlib.

| colorbar : boolean, optional

| If True, plot colorbar (only relevant for 'scatter' and 'hexbin' plots)

| position : float

| Specify relative alignments for bar plot layout.

| From 0 (left/bottom-end) to 1 (right/top-end). Default is 0.5 (center)

| table : boolean, Series or DataFrame, default False

| If True, draw a table using the data in the DataFrame and the data will

| be transposed to meet matplotlib's default layout.

| If a Series or DataFrame is passed, use passed data to draw a table.

| label : label argument to provide to plot

| mark_right : boolean, default True

| When using a secondary_y axis, automatically mark the column

| labels with "(right)" in the legend

| kwds : keywords

| Options to pass to matplotlib plotting method

import matplotlib.pyplot as plt

#注：导入matplotlib.pyplot

import matplotlib

matplotlib.style.use('ggplot')

%matplotlib inline

#注：使用ggplot样式，并且将图画在jupyter notebook中

import pandas as pd

import numpy as np

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

ts.plot()

#注：使用pandas创建一个Series（序列），序列值是随机生成的1000个标准正态分布值，索引是从2000-1-1开始的1000个时间序列值

#然后使用plot默认画图

#可以看出，这个图非常不规则，因为相邻的两个值也是随机大小

import pandas as pd

import numpy as np

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()

#注：这里加上了ts = ts.cumsum()，意思是返回累积值，在时间序列分析中，经常观察累积值曲线观察走势。

ts.plot(label=True,title='pandas',color='green',fontsize=12)#label : label argument to provide to plot

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))

df = df.cumsum()

plt.figure()

df.plot()

df3 = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()

df3['A'] = pd.Series(list(range(len(df))))

df3.plot(x='A', y='B')

#注：使用DataFrame创建了2组数据，也是1000个标准正态分布，分别命名为B、C（就行excel中列名）

#条形图

#对于标记的非时间序列数据，你可能希望生成条形图：

import matplotlib

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

matplotlib.style.use('ggplot')#使用ggplot样式

%matplotlib inline

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list('ABCD'))

df = df.cumsum()

plt.figure();

df.iloc[5].plot(kind='bar')

plt.axhline(0, color='k')

df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])

df2.plot.bar()

stacked=’False’

#注：如果看了前一章，以上代码很好理解，这里同一幅图绘制多个条形图比较

df2.plot.barh(stacked=True)

#注：绘制水平叠加条形图

df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),

'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])

df4.head()

#注：以标准正态分布为基础产生一个df4

plt.figure()

df4.plot.hist(alpha=0.5)

#注：绘制直方图，alpha=0.5意思为透明度为50%

#注：明明有三列数据，画出有六种颜色，应该是是颜色叠加出了新颜色

plt.figure()

df4.plot.hist(stacked=True, bins=20)

#注：画堆积直方图（不会重叠，直观）bins=20指的是分为20个区段

#图中没有显示20个因为正态分布距离中心远处取值可能心很小，可能只有个别值，在图中不明显

plt.figure();

df4['a'].plot.hist(orientation='horizontal', cumulative=True)

#注：取出A列画横向累积直方图，采用默认10个区段

plt.figure()

df.diff().hist(color='g', alpha=0.5, bins=20)

#注：df中四栏（4列）分别绘制一阶差分的图（绿色，半透明，分20区段），有种2x2子图的感觉

df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])

df.head()

#注：使用numpy产生一个10行5列的矩阵，矩阵内的元素是随机产生的标准正态分布数字

#然后将矩阵转为df，给这五列数据栏位命名为A、B、C、D、E，并显示其前五行观察

df.plot.box() #注：绘制箱线图

color = dict(boxes='DarkGreen', whiskers='DarkOrange',medians='DarkBlue', caps='Gray')

df.plot.box(color=color, sym='r+')

#注：绘制箱线图，这里增加了箱线图每部分线条颜色的设置，至于sym官方解释是：specify fliers style

#直译为：指定传单风格，我们暂且可以认为是绘图的某种风格，我把这个参数去掉绘图发现没什么区别

df.plot.box(vert=False, positions=[1, 4, 5, 6, 8])

#注：这里vert = False绘制水平箱线图，并且将幕布竖直方向分为8块等大小区域

#将5个箱线图从下到上依次画在1、4、5、6、8五个位置

df = pd.DataFrame(np.random.rand(10,2), columns=['Col1', 'Col2'] )

df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])

plt.figure()

bp = df.boxplot(by='X')

#注：在原有的df上增加一栏X，其中包含A，B两种分类，最后根据这种分类分别对col1和col2画箱线图

df = pd.DataFrame(np.random.rand(10,3), columns=['Col1', 'Col2', 'Col3'])

df['X'] = pd.Series(['A','A','A','A','A','B','B','B','B','B'])

df['Y'] = pd.Series(['A','B','A','B','A','B','A','B','A','B'])

plt.figure();

bp = df.boxplot(column=['Col1','Col2'], by=['X','Y'])

#注：这里传入两个分类依据，所以2X2有四种分类组合

#和以下绘图代码注意对比：

bp = df_box.groupby('g').boxplot()

#注：by有分开的意思，groupby有整合的意思

#散点图

#可以使用DataFrame.plot.scatter（）方法绘制散点图。

#散点图需要x和y轴的数字列。这些可以由x和y关键字指定。

df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])

df.plot.scatter(x='a', y='b')

#注：将产生的a栏作为x轴数据，b栏作为y轴数据绘图

df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])

ax=df.plot.scatter(x='a', y='b', color='DarkBlue', label='Group 1')

df.plot.scatter(x='c', y='d', color='RED', label='Group 2',ax=ax)

#注：要在单个轴上绘制多个列组，要重复指定目标轴的绘图方法，建议指定颜色和标签关键字来区分每个组。

df.plot.scatter(x='a', y='b', c='c', s=50)

#注：关键字c可以作为列的名称给出，以为每个点提供颜色

df.plot.scatter(x='a', y='b', s=df['c']*200)

#注：增加c栏作为气泡（散点）大小值

二、图形可视化之 pyecharts(简单图的骚操作)

总的来看，百度开发的这款可视化工具更适合时间序列数据，列入做股票，销售数据这块，当然，他在地图这块好像也很好用，但是我们发现他的数据都是通过列表的形式传入，bar.add('profit',x_axis=index,y_axis=df1.loc[:,0].values), bar.add('loss',index,df2.iloc[:,0].values)

柱状图来说每列代表一个数据，其实对于更一般的bar来说并不好用。但是视觉效果很强。

Histogram倒是没有，对于一一对应的数据用这个画图可能是个不错的选择。

Tip：可以按右边的下载按钮将图片下载到本地

add()

主要方法，用于添加图表的数据和设置各种配置项

show_config()

打印输出图表的所有配置项

render()

默认将会在根目录下生成一个 render.html 的文件，支持 path 参数，设置文件保存位置，如 render(r"e:my_first_chart.html")，文件用浏览器打开。

默认的编码类型为 UTF-8，在 Python3 中是没什么问题的，Python3 对中文的支持好很多。但是在 Python2 中，编码的处理是个很头疼的问题，暂时没能找到完美的解决方法，目前只能通过文本编辑器自己进行二次编码，我用的是 Visual Studio Code，先通过 Gbk 编码重新打开，然后再用 UTF-8 重新保存，这样用浏览器打开的话就不会出现中文乱码问题了。

基本上所有的图表类型都是这样绘制的：

chart_name = Type() 初始化具体类型图表。

add() 添加数据及配置项。

render() 生成 .html 文件。

from pyecharts import Bar

bar =Bar("我的第一个图表", "这里是副标题")#添加主标题和副标题

bar.add("服装", ["衬衫", "羊毛衫", "雪纺衫", "裤子", "高跟鞋", "袜子"], [5, 20, 36, 10, 75, 90])

bar.show_config()#显示配置

bar.render()#生成render文件，可以再浏览器直接打开

#可以直接下载，下图是下载结果

class Bar(pyecharts.chart.Chart)

| <<< 柱状图/条形图 >>>

| 柱状/条形图，通过柱形的高度/条形的宽度来表现数据的大小。

| Method resolution order:

| Bar

| pyecharts.chart.Chart

| pyecharts.base.Base

| builtins.object

| Methods defined here:

| __init__(self, title='', subtitle='', **kwargs)

| :param title:

| 主标题文本，支持

| 换行，默认为 ""

| :param subtitle:

| 副标题文本，支持

| 换行，默认为 ""

| :param width:

| 画布宽度，默认为 800（px）

| :param height:

| 画布高度，默认为 400（px）

| :param title_pos:

| 标题距离左侧距离，默认为'left'，有'auto', 'left', 'right',

| 'center'可选，也可为百分比或整数

| :param title_top:

| 标题距离顶部距离，默认为'top'，有'top', 'middle', 'bottom'可选，

| 也可为百分比或整数

| :param title_color:

| 主标题文本颜色，默认为 '#000'

| :param subtitle_color:

| 副标题文本颜色，默认为 '#aaa'

| :param title_text_size:

| 主标题文本字体大小，默认为 18

| :param subtitle_text_size:

| 副标题文本字体大小，默认为 12

| :param background_color:

| 画布背景颜色，默认为 '#fff'

| :param page_title:

| 指定生成的 html 文件中 <title> 标签的值。默认为 'Echarts'

| :param renderer:

| 指定使用渲染方式，有 'svg' 和 'canvas' 可选，默认为 'canvas'。

| 3D 图仅能使用 'canvas'。

| :param extra_html_text_label:

| 额外的 HTML 文本标签，(<p> 标签)。类型为 list，list[0] 为文本内容，

| list[1] 为字体风格样式（选填）。如 ["this is a p label", "color:red"]

| add(self, *args, **kwargs)

| `add()` 方法只是用于提供自动参数补全

| ----------------------------------------------------------------------

| Methods inherited from pyecharts.base.Base:

| get_js_dependencies(self)

| 声明所有的 js 文件路径

| on(self, event_name, handler)

| print_echarts_options(self)

| 打印输出图形所有配置项

| render(self, path='render.html', template_name='simple_chart.html', object_name='chart', **kwargs)

| render_embed(self)

| 渲染图表的所有配置项，为 web pages 服务，不过需先提供所需要的js 依赖文件

| render_notebook(self)

| show_config(self)

| 打印输出图形所有配置项

| use_theme(self, theme_name)

| ----------------------------------------------------------------------

| Static methods inherited from pyecharts.base.Base:

| cast(seq)

| 转换数据序列，将带字典和元组类型的序列转换为 k_lst,v_lst 两个列表

| 元组列表

| [(A1, B1), (A2, B2), ...] -->

| k_lst[ A[i1, i2...] ], v_lst[ B[i1, i2...] ]

| 字典列表

| [{A1: B1}, {A2: B2}, ...] -->

| k_lst[ A[i1, i2...] ], v_lst[ B[i1, i2...] ]

| 字典

| {A1: B1, A2: B2, ...} -- >

| k_lst[ A[i1, i2...] ], v_lst[ B[i1, i2...] ]

from pyecharts import Bar

attr = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]

v1 = [2.0, 4.9, 7.0, 23.2, 25.6, 76.7, 135.6, 162.2, 32.6, 20.0, 6.4, 3.3]

v2 = [2.6, 5.9, 9.0, 26.4, 28.7, 70.7, 175.6, 182.2, 48.7, 18.8, 6.0, 2.3]

bar = Bar("Bar chart", "precipitation and evaporation one year")

bar.add("precipitation", attr, v1, mark_line=["average"], mark_point=["max", "min"])

bar.add("evaporation", attr, v2, mark_line=["average"], mark_point=["max", "min"])

bar

import pandas as pd

import numpy as np

title='bar chart'

index=pd.date_range('3/8/2017',periods=6,freq='M')

df1=pd.DataFrame(np.random.randn(6),index=index)

df2=pd.DataFrame(np.random.randn(6),index=index)

dtvalues1=[x[0] for x in df1.values]

dtvalues2=[x[0] for x in df2.values]

bar=Bar(title,'FuBiaoTi')

bar.add('profit',index,dtvalues1)# bar.add('profit',x_axis=index,y_axis=dtvalues1)

#这种写法其

#实来说挺麻烦的，以列表形式传入数据。

# is_convert=True，是否已水平方向作图。默认是false

bar.add('loss',index,dtvalues2)

bar

#Scatter（散点图）

from pyecharts import Scatter

v1 =[10, 20, 30, 40, 50, 60]

v2 =[10, 20, 30, 40, 50, 60]

scatter =Scatter("散点图示例")

scatter.add("A", v1, v2)

scatter.add("B", v1[::-1], v2)

scatter

import random

attr = ["{}天".format(i) for i in range(30)]

v1 = [random.randint(1, 30) for _ in range(30)]

bar = Bar("Bar - datazoom - slider 示例")

bar.add("", attr, v1, is_label_show=True, is_datazoom_show=True)#is_label_show=True显示值， is_datazoom_show=True是否对数据局域展示

bar

#当 x 轴或者 y 轴的标签因为过于密集而导致全部显示出来会重叠的话，可采用使标签旋转的方法

attr = ["{}天".format(i) for i in range(20)]

v1 = [random.randint(1, 20) for _ in range(20)]

bar = Bar("坐标轴标签旋转示例")

bar.add("", attr, v1,xaxis_interval=5, xaxis_rotate=30, yaxis_rotate=30)#x，y向上30度，xaxis_interval=4每隔几个显示数据

bar

可直接通过开盘收盘最高最低价直接绘制k均值线。

from pyecharts import Bar, Line, Scatter, EffectScatter, Grid

attr = ["衬衫", "羊毛衫", "雪纺衫", "裤子", "高跟鞋", "袜子"]

v1 = [5, 20, 36, 10, 75, 90]

v2 = [10, 25, 8, 60, 20, 80]

bar = Bar("柱状图示例", height=720, width=1200, title_pos="65%")

bar.add("商家A", attr, v1, is_stack=True)

bar.add("商家B", attr, v2, is_stack=True, legend_pos="80%")

line = Line("折线图示例")

attr = ['周一', '周二', '周三', '周四', '周五', '周六', '周日']

line.add("最高气温", attr, [11, 11, 15, 13, 12, 13, 10], mark_point=["max", "min"], mark_line=["average"])

line.add("最低气温", attr, [1, -2, 2, 5, 3, 2, 0], mark_point=["max", "min"],

mark_line=["average"], legend_pos="20%")

v1 = [5, 20, 36, 10, 75, 90]

v2 = [10, 25, 8, 60, 20, 80]

scatter = Scatter("散点图示例", title_top="50%", title_pos="65%")

scatter.add("scatter", v1, v2, legend_top="50%", legend_pos="80%")

es = EffectScatter("动态散点图示例", title_top="50%")

es.add("es", [11, 11, 15, 13, 12, 13, 10], [1, -2, 2, 5, 3, 2, 0], effect_scale=6,

legend_top="50%", legend_pos="20%")

grid = Grid()

grid.add(bar, grid_bottom="60%", grid_left="60%")

grid.add(line, grid_bottom="60%", grid_right="60%")

grid.add(scatter, grid_top="60%", grid_left="60%")

grid.add(es, grid_top="60%", grid_right="60%")

grid

from pyecharts import Bar, Timeline

from random import randint

attr = ["衬衫", "羊毛衫", "雪纺衫", "裤子", "高跟鞋", "袜子"]

bar_1 = Bar("2012 年销量", "数据纯属虚构")

bar_1.add("春季", attr, [randint(10, 100) for _ in range(6)])

bar_1.add("夏季", attr, [randint(10, 100) for _ in range(6)])

bar_1.add("秋季", attr, [randint(10, 100) for _ in range(6)])

bar_1.add("冬季", attr, [randint(10, 100) for _ in range(6)])

bar_2 = Bar("2013 年销量", "数据纯属虚构")

bar_2.add("春季", attr, [randint(10, 100) for _ in range(6)])

bar_2.add("夏季", attr, [randint(10, 100) for _ in range(6)])

bar_2.add("秋季", attr, [randint(10, 100) for _ in range(6)])

bar_2.add("冬季", attr, [randint(10, 100) for _ in range(6)])

bar_3 = Bar("2014 年销量", "数据纯属虚构")

bar_3.add("春季", attr, [randint(10, 100) for _ in range(6)])

bar_3.add("夏季", attr, [randint(10, 100) for _ in range(6)])

bar_3.add("秋季", attr, [randint(10, 100) for _ in range(6)])

bar_3.add("冬季", attr, [randint(10, 100) for _ in range(6)])

bar_4 = Bar("2015 年销量", "数据纯属虚构")

bar_4.add("春季", attr, [randint(10, 100) for _ in range(6)])

bar_4.add("夏季", attr, [randint(10, 100) for _ in range(6)])

bar_4.add("秋季", attr, [randint(10, 100) for _ in range(6)])

bar_4.add("冬季", attr, [randint(10, 100) for _ in range(6)])

bar_5 = Bar("2016 年销量", "数据纯属虚构")

bar_5.add("春季", attr, [randint(10, 100) for _ in range(6)])

bar_5.add("夏季", attr, [randint(10, 100) for _ in range(6)])

bar_5.add("秋季", attr, [randint(10, 100) for _ in range(6)])

bar_5.add("冬季", attr, [randint(10, 100) for _ in range(6)], is_legend_show=True)

timeline = Timeline(is_auto_play=True, timeline_bottom=0)

timeline.add(bar_1, '2012 年')

timeline.add(bar_2, '2013 年')

timeline.add(bar_3, '2014 年')

timeline.add(bar_4, '2015 年')

timeline.add(bar_5, '2016 年')

timeline

wei_liao

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫