数据可视化
参考
一、matplotlib、seaborn介绍
1.1 matplotlib
1.1.1 基本介绍
官方文档
1.1.2导入惯例
import matplotlib. pyplot as plt
1.1.3 pylop 与 pylab
matplotlib.pyplot是使Matplotlib像MATLAB一样工作的命令样式函数的集合。每个pyplot函数都会对图形进行一些更改:例如,创建图形,在图形中创建绘图区域,在绘图区域中绘制一些线,用标签装饰绘图等 pylab是一个模块,其包括matplotlib.pyplot,numpy 和单个名称空间内的一些附加功能。它的最初目的是通过将所有函数导入全局名称空间来模仿类似于MATLAB的工作方式 由于大量导入全局名称空间可能会导致意外行为,因此强烈建议不要使用pylab。使用matplotlib.pyplot 代替
import matplotlib. pyplot as plt
import numpy as np
% matplotlib inline
1.2 seaborn
Seaborn是基于matplotlib的Python数据可视化库。它提供了用于绘制引人入胜且内容丰富的统计图形的高级界面 Seaborn是把matplotlib的部分功能根据常用组合进行封装,使初学者也能绘制出较为实用的图 难以实现特定需求的定制化图 初学可视化的同学建议以seaborn入手,可以满足大部分需求
官方教程
导入惯例
import seaborn as sns
二、基础绘图
from sklearn. datasets import load_boston
data = load_boston( )
data
{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
4.9800e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
9.1400e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
4.0300e+00],
...,
[6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
5.6400e+00],
[1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
6.4800e+00],
[4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
7.8800e+00]]),
'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21.5, 19.6, 15.3, 19.4,
17. , 15.6, 13.1, 41.3, 24.3, 23.3, 27. , 50. , 50. , 50. , 22.7,
25. , 50. , 23.8, 23.8, 22.3, 17.4, 19.1, 23.1, 23.6, 22.6, 29.4,
23.2, 24.6, 29.9, 37.2, 39.8, 36.2, 37.9, 32.5, 26.4, 29.6, 50. ,
32. , 29.8, 34.9, 37. , 30.5, 36.4, 31.1, 29.1, 50. , 33.3, 30.3,
34.6, 34.9, 32.9, 24.1, 42.3, 48.5, 50. , 22.6, 24.4, 22.5, 24.4,
20. , 21.7, 19.3, 22.4, 28.1, 23.7, 25. , 23.3, 28.7, 21.5, 23. ,
26.7, 21.7, 27.5, 30.1, 44.8, 50. , 37.6, 31.6, 46.7, 31.5, 24.3,
31.7, 41.7, 48.3, 29. , 24. , 25.1, 31.5, 23.7, 23.3, 22. , 20.1,
22.2, 23.7, 17.6, 18.5, 24.3, 20.5, 24.5, 26.2, 24.4, 24.8, 29.6,
42.8, 21.9, 20.9, 44. , 50. , 36. , 30.1, 33.8, 43.1, 48.8, 31. ,
36.5, 22.8, 30.7, 50. , 43.5, 20.7, 21.1, 25.2, 24.4, 35.2, 32.4,
32. , 33.2, 33.1, 29.1, 35.1, 45.4, 35.4, 46. , 50. , 32.2, 22. ,
20.1, 23.2, 22.3, 24.8, 28.5, 37.3, 27.9, 23.9, 21.7, 28.6, 27.1,
20.3, 22.5, 29. , 24.8, 22. , 26.4, 33.1, 36.1, 28.4, 33.4, 28.2,
22.8, 20.3, 16.1, 22.1, 19.4, 21.6, 23.8, 16.2, 17.8, 19.8, 23.1,
21. , 23.8, 23.1, 20.4, 18.5, 25. , 24.6, 23. , 22.2, 19.3, 22.6,
19.8, 17.1, 19.4, 22.2, 20.7, 21.1, 19.5, 18.5, 20.6, 19. , 18.7,
32.7, 16.5, 23.9, 31.2, 17.5, 17.2, 23.1, 24.5, 26.6, 22.9, 24.1,
18.6, 30.1, 18.2, 20.6, 17.8, 21.7, 22.7, 22.6, 25. , 19.9, 20.8,
16.8, 21.9, 27.5, 21.9, 23.1, 50. , 50. , 50. , 50. , 50. , 13.8,
13.8, 15. , 13.9, 13.3, 13.1, 10.2, 10.4, 10.9, 11.3, 12.3, 8.8,
7.2, 10.5, 7.4, 10.2, 11.5, 15.1, 23.2, 9.7, 13.8, 12.7, 13.1,
12.5, 8.5, 5. , 6.3, 5.6, 7.2, 12.1, 8.3, 8.5, 5. , 11.9,
27.9, 17.2, 27.5, 15. , 17.2, 17.9, 16.3, 7. , 7.2, 7.5, 10.4,
8.8, 8.4, 16.7, 14.2, 20.8, 13.4, 11.7, 8.3, 10.2, 10.9, 11. ,
9.5, 14.5, 14.1, 16.1, 14.3, 11.7, 13.4, 9.6, 8.7, 8.4, 12.8,
10.5, 17.1, 18.4, 15.4, 10.8, 11.8, 14.9, 12.6, 14.1, 13. , 13.4,
15.2, 16.1, 17.8, 14.9, 14.1, 12.7, 13.5, 14.9, 20. , 16.4, 17.7,
19.5, 20.2, 21.4, 19.9, 19. , 19.1, 19.1, 20.1, 19.9, 19.6, 23.2,
29.8, 13.8, 13.3, 16.7, 12. , 14.6, 21.4, 23. , 23.7, 25. , 21.8,
20.6, 21.2, 19.1, 20.6, 15.2, 7. , 8.1, 13.6, 20.1, 21.8, 24.5,
23.1, 19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9]),
'feature_names': array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'),
'DESCR': ".. _boston_dataset:\n\nBoston house prices dataset\n---------------------------\n\n**Data Set Characteristics:** \n\n :Number of Instances: 506 \n\n :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n\n :Attribute Information (in order):\n - CRIM per capita crime rate by town\n - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n - INDUS proportion of non-retail business acres per town\n - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n - NOX nitric oxides concentration (parts per 10 million)\n - RM average number of rooms per dwelling\n - AGE proportion of owner-occupied units built prior to 1940\n - DIS weighted distances to five Boston employment centres\n - RAD index of accessibility to radial highways\n - TAX full-value property-tax rate per $10,000\n - PTRATIO pupil-teacher ratio by town\n - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n - LSTAT % lower status of the population\n - MEDV Median value of owner-occupied homes in $1000's\n\n :Missing Attribute Values: None\n\n :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980. N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems. \n \n.. topic:: References\n\n - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n",
'filename': 'C:\\Users\\CK\\anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\boston_house_prices.csv'}
import pandas as pd
x_df = pd. DataFrame( data[ 'data' ] , columns= data[ 'feature_names' ] )
x_df. head( )
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
x_df. T. head( )
0 1 2 3 4 5 6 7 8 9 ... 496 497 498 499 500 501 502 503 504 505 CRIM 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985 0.08829 0.14455 0.21124 0.17004 ... 0.2896 0.26838 0.23912 0.17783 0.22438 0.06263 0.04527 0.06076 0.10959 0.04741 ZN 18.00000 0.00000 0.00000 0.00000 0.00000 0.00000 12.50000 12.50000 12.50000 12.50000 ... 0.0000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 INDUS 2.31000 7.07000 7.07000 2.18000 2.18000 2.18000 7.87000 7.87000 7.87000 7.87000 ... 9.6900 9.69000 9.69000 9.69000 9.69000 11.93000 11.93000 11.93000 11.93000 11.93000 CHAS 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 ... 0.0000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 NOX 0.53800 0.46900 0.46900 0.45800 0.45800 0.45800 0.52400 0.52400 0.52400 0.52400 ... 0.5850 0.58500 0.58500 0.58500 0.58500 0.57300 0.57300 0.57300 0.57300 0.57300
5 rows × 506 columns
y_df = pd. DataFrame( data[ 'target' ] )
y_df. head( )
0 0 24.0 1 21.6 2 34.7 3 33.4 4 36.2
2.1 图表的基本元素
% matplotlib inline
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
from scipy import stats
import seaborn as sns
9.8 * 500
4900.0
plt. plot( x_df)
[<matplotlib.lines.Line2D at 0x1e844f9f2c8>,
<matplotlib.lines.Line2D at 0x1e844f9f448>,
<matplotlib.lines.Line2D at 0x1e844f9f608>,
<matplotlib.lines.Line2D at 0x1e844f9f7c8>,
<matplotlib.lines.Line2D at 0x1e844f9fa08>,
<matplotlib.lines.Line2D at 0x1e844f9fc88>,
<matplotlib.lines.Line2D at 0x1e844f9fe88>,
<matplotlib.lines.Line2D at 0x1e844fa5108>,
<matplotlib.lines.Line2D at 0x1e844f9f988>,
<matplotlib.lines.Line2D at 0x1e844f9fc08>,
<matplotlib.lines.Line2D at 0x1e844f7b208>,
<matplotlib.lines.Line2D at 0x1e844fa5948>,
<matplotlib.lines.Line2D at 0x1e844fa5b88>]
plt. plot( np. linspace( 1 , 10 , 50 ) , np. sin( np. linspace( 1 , 10 , 50 ) ) )
[<matplotlib.lines.Line2D at 0x1e845024a48>]
图名 x轴标签 y轴标签 图例 x轴边界 y轴边界 x刻度 y刻度 x刻度标签 y刻度标签
data_df. head( )
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-edf0c3c225cb> in <module>
----> 1 data_df.head()
NameError: name 'data_df' is not defined
data_df = pd. DataFrame( x_df[ [ 'AGE' , 'RM' ] ] )
fig = data_df. plot( figsize= ( 9 , 6 ) )
2.2 图表样式及注解
linestyle color marker style (linestyle、marker、color) alpha colormap #Matplotlib附带的色彩映射 grid text
help ( plt. plot)
df = x_df[ 'AGE' ] [ 0 : 20 ]
df. plot( linestyle = '--' ,
marker = 'o' ,
color= "r" ,
grid= True )
<matplotlib.axes._subplots.AxesSubplot at 0x1e845117f08>
x_df[ 0 : 20 ] . plot( colormap = 'Dark2_r' )
<matplotlib.axes._subplots.AxesSubplot at 0x1a28a57790>
cmaps = [ ( 'Perceptually Uniform Sequential' , [
'viridis' , 'plasma' , 'inferno' , 'magma' , 'cividis' ] ) ,
( 'Sequential' , [
'Greys' , 'Purples' , 'Blues' , 'Greens' , 'Oranges' , 'Reds' ,
'YlOrBr' , 'YlOrRd' , 'OrRd' , 'PuRd' , 'RdPu' , 'BuPu' ,
'GnBu' , 'PuBu' , 'YlGnBu' , 'PuBuGn' , 'BuGn' , 'YlGn' ] ) ,
( 'Sequential (2)' , [
'binary' , 'gist_yarg' , 'gist_gray' , 'gray' , 'bone' , 'pink' ,
'spring' , 'summer' , 'autumn' , 'winter' , 'cool' , 'Wistia' ,
'hot' , 'afmhot' , 'gist_heat' , 'copper' ] ) ,
( 'Diverging' , [
'PiYG' , 'PRGn' , 'BrBG' , 'PuOr' , 'RdGy' , 'RdBu' ,
'RdYlBu' , 'RdYlGn' , 'Spectral' , 'coolwarm' , 'bwr' , 'seismic' ] ) ,
( 'Cyclic' , [ 'twilight' , 'twilight_shifted' , 'hsv' ] ) ,
( 'Qualitative' , [
'Pastel1' , 'Pastel2' , 'Paired' , 'Accent' ,
'Dark2' , 'Set1' , 'Set2' , 'Set3' ,
'tab10' , 'tab20' , 'tab20b' , 'tab20c' ] ) ,
( 'Miscellaneous' , [
'flag' , 'prism' , 'ocean' , 'gist_earth' , 'terrain' , 'gist_stern' ,
'gnuplot' , 'gnuplot2' , 'CMRmap' , 'cubehelix' , 'brg' ,
'gist_rainbow' , 'rainbow' , 'jet' , 'nipy_spectral' , 'gist_ncar' ] ) ]
gradient = np. linspace( 0 , 1 , 256 )
gradient = np. vstack( ( gradient, gradient) )
def plot_color_gradients ( cmap_category, cmap_list) :
nrows = len ( cmap_list)
figh = 0.35 + 0.15 + ( nrows + ( nrows- 1 ) * 0.1 ) * 0.22
fig, axes = plt. subplots( nrows= nrows, figsize= ( 6.4 , figh) )
fig. subplots_adjust( top= 1 - .35 / figh, bottom= .15 / figh, left= 0.2 , right= 0.99 )
axes[ 0 ] . set_title( cmap_category + ' colormaps' , fontsize= 14 )
for ax, name in zip ( axes, cmap_list) :
ax. imshow( gradient, aspect= 'auto' , cmap= plt. get_cmap( name) )
ax. text( - .01 , .5 , name, va= 'center' , ha= 'right' , fontsize= 10 ,
transform= ax. transAxes)
for ax in axes:
ax. set_axis_off( )
for cmap_category, cmap_list in cmaps:
plot_color_gradients( cmap_category, cmap_list)
df. plot( style = 'o' )
plt. plot( df. argmax( ) , df. max ( ) , marker = 'o' , color = 'r' )
plt. text( df. argmax( ) , max ( df) , 'max_age' , fontsize= 12 )
Text(8, 100.0, 'max_age')
sns. plot
2.3 子图
help ( plt. figure)
fig_1 = plt. figure( num= 1 , figsize= ( 8 , 6 ) )
plt. plot( df, 'r--' )
fig_2 = plt. figure( num= 1 , figsize= ( 8 , 6 ) )
plt. plot( x_df[ 'AGE' ] [ 20 : 40 ] )
fig_2 = plt. figure( num= 2 , figsize= ( 8 , 6 ) )
plt. plot( x_df[ 'AGE' ] [ 40 : 60 ] )
[<matplotlib.lines.Line2D at 0x1a27d7d3d0>]
help ( plt. subplots)
fig, axes = plt. subplots( 2 , 3 , figsize= ( 10 , 4 ) )
fig
axes
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E845124B88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001E8451A4F08>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001E8451DF748>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001E845215E88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001E8452ADD88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001E8452E6CC8>]],
dtype=object)
ax1 = axes[ 0 , 2 ]
ax1. plot( df)
fig
fig, axes = plt. subplots( 2 , 3 , figsize= ( 10 , 4 ) , sharex = True , sharey = True )
df_4 = x_df[ [ 'CRIM' , 'ZN' , 'INDUS' , 'CHAS' , 'NOX' ] ]
df_4. plot( style = '-' , alpha = 0.4 , figsize = ( 20 , 8 ) ,
subplots = True ,
layout = ( 1 , 5 ) ,
sharey = True )
plt. subplots_adjust( wspace= 0 , hspace= 0.2 )
x_df
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 ... ... ... ... ... ... ... ... ... ... ... ... ... ... 501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88
506 rows × 13 columns
df_4. plot( style = '-' , alpha = 0.4 , figsize = ( 20 , 8 ) ,
subplots = False ,
layout = ( 1 , 5 ) ,
sharey = True )
plt. subplots_adjust( wspace= 0 , hspace= 0 )
3.分布数据
datasets. load_iris
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-44-fea465bb406f> in <module>
----> 1 datasets.load_iris
NameError: name 'datasets' is not defined
x_df. head( )
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT 0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33
y_df. head( )
0 0 24.0 1 21.6 2 34.7 3 33.4 4 36.2
3.1 直方图
3.1.1 matplotlib
plt. hist( x_df[ 'AGE' ] )
(array([ 14., 31., 29., 42., 32., 38., 39., 42., 71., 168.]),
array([ 2.9 , 12.61, 22.32, 32.03, 41.74, 51.45, 61.16, 70.87,
80.58, 90.29, 100. ]),
<a list of 10 Patch objects>)
3.1.2 seaborn
sns. distplot( x_df[ 'AGE' ] )
<matplotlib.axes._subplots.AxesSubplot at 0x1e84721ce08>
sns. distplot( x_df[ 'AGE' ] ,
bins = 10 ,
hist = True ,
kde = True ,
norm_hist= True ,
rug = True ,
vertical = False ,
color = 'y' ,
axlabel = 'x' )
<matplotlib.axes._subplots.AxesSubplot at 0x1a314d2f10>
sns. distplot( x_df[ 'AGE' ] ,
rug = True ,
rug_kws = { 'color' : 'g' } ,
kde_kws= { "color" : "k" , "lw" : 1 , "label" : "AGE" , 'linestyle' : '--' } ,
hist_kws= { "histtype" : "step" , "linewidth" : 1 , "alpha" : 1 , "color" : "g" } )
<matplotlib.axes._subplots.AxesSubplot at 0x1a32621d90>
3.1.3 密度图
sns. kdeplot( x_df[ 'AGE' ] , x_df[ 'RM' ] ,
cbar = True ,
shade = True ,
cmap = 'Reds' ,
shade_lowest= False ,
n_levels = 10
)
sns. rugplot( x_df[ 'AGE' ] , color= "y" , axis= 'x' , alpha = 0.5 )
sns. rugplot( x_df[ 'RM' ] , color= "g" , axis= 'y' , alpha = 0.5 )
<matplotlib.axes._subplots.AxesSubplot at 0x1a320a8990>
sns. kdeplot( x_df[ 'AGE' ] [ 0 : 200 ] , x_df[ 'RM' ] [ 0 : 200 ] , cmap = 'Greens' ,
shade = True , shade_lowest= False )
sns. kdeplot( x_df[ 'AGE' ] [ 200 : 400 ] , x_df[ 'RM' ] [ 200 : 400 ] , cmap = 'Blues' ,
shade = True , shade_lowest= False )
sns. rugplot( x_df[ 'AGE' ] [ 0 : 400 ] , color= "g" , axis= 'x' , alpha = 0.5 )
sns. rugplot( x_df[ 'RM' ] [ 0 : 400 ] , color= "r" , axis= 'y' , alpha = 0.5 )
<matplotlib.axes._subplots.AxesSubplot at 0x1a3254af50>
3.2 散点图
3.2.1 matplotlib
plt. scatter( range ( 0 , y_df. shape[ 0 ] ) ,
x_df[ 'AGE' ] ,
marker= '.' ,
s = ( y_df- y_df. mean( ) ) * 10 ,
cmap = 'Reds_r' ,
alpha = 1 , )
<matplotlib.collections.PathCollection at 0x1a33592390>
3.2.2 seaborn
sns. jointplot( range ( 0 , y_df. shape[ 0 ] ) , y= x_df[ 'AGE' ] ,
data= x_df,
s = ( y_df- y_df. mean( ) ) * 10 ,
edgecolor= "w" , linewidth= 1 ,
kind = 'scatter' ,
space = 0.2 ,
size = 8 ,
ratio = 5 ,
marginal_kws= dict ( bins= 15 , rug= True )
)
<seaborn.axisgrid.JointGrid at 0x1a343e4c10>
sns. jointplot( x= x_df[ 'LSTAT' ] , y= x_df[ 'AGE' ] ,
data= x_df,
s = ( y_df- y_df. mean( ) ) * 10 ,
edgecolor= "w" , linewidth= 1 ,
kind = 'scatter' ,
marginal_kws= dict ( bins= 15 , rug= True )
)
<seaborn.axisgrid.JointGrid at 0x1a34bb3d10>
with sns. axes_style( "white" ) :
sns. jointplot( x= x_df[ 'LSTAT' ] , y= x_df[ 'AGE' ] , data = x_df, kind= "hex" , color= "g" ,
marginal_kws= dict ( bins= 20 ) )
g = sns. jointplot( x= x_df[ 'LSTAT' ] , y= x_df[ 'AGE' ] , data = x_df,
kind= "kde" , color= "k" ,
shade_lowest= False )
g. plot_joint( plt. scatter, c= "w" , s= 30 , linewidth= 1 , marker= "*" )
<seaborn.axisgrid.JointGrid at 0x1a350b7cd0>
sns. set_style( "white" )
g = sns. JointGrid( x= 'LSTAT' , y= 'RM' , data= x_df)
g. plot_joint( plt. scatter, color = 'm' , edgecolor = 'white' )
g. ax_marg_x. hist( x_df[ 'LSTAT' ] , color= "b" , alpha= .6 )
g. ax_marg_y. hist( x_df[ 'RM' ] , color= "r" , alpha= .6 ,
orientation= "horizontal" )
(array([ 2., 4., 14., 45., 177., 151., 69., 22., 13., 9.]),
array([3.561 , 4.0829, 4.6048, 5.1267, 5.6486, 6.1705, 6.6924, 7.2143,
7.7362, 8.2581, 8.78 ]),
<a list of 10 Patch objects>)
3.3 矩阵散点图
3.3.1 matplotlib
from sklearn. datasets import load_iris
iris = load_iris( )
iris
{'data': array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n \n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': '/Users/edz/opt/anaconda3/lib/python3.7/site-packages/sklearn/datasets/data/iris.csv'}
iris_x = pd. DataFrame( iris[ 'data' ] , columns= iris[ 'feature_names' ] )
iris_x. head( )
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
iris_y = pd. DataFrame( iris[ 'target' ] )
iris_y
0 0 0 1 0 2 0 3 0 4 0 ... ... 145 2 146 2 147 2 148 2 149 2
150 rows × 1 columns
from pandas. plotting import scatter_matrix
scatter_matrix( iris_x, figsize= ( 10 , 6 ) ,
marker = 'o' ,
diagonal= 'kde' ,
alpha = 0.5 ,
range_padding= 0.5 ,
cmap= 'Summer' )
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a412669d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a4488cf90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a445d9f90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a45a3f850>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x1a3d349bd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a42c02890>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a424ecd10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a54eab8d0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x1a54ebc450>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a376e5dd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a53ad8c90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a4f219950>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x1a379d5cd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a3863b990>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a51013d10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x1a376b99d0>]],
dtype=object)
3.3.2 seaborn
sns. pairplot( iris_x. join( iris_y) ,
kind = 'reg' ,
diag_kind= "kde" ,
hue= 0 ,
palette= "husl" ,
markers= [ "o" , "s" , "D" ] ,
size = 2 ,
)
/Users/edz/opt/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:2079: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
warnings.warn(msg, UserWarning)
<seaborn.axisgrid.PairGrid at 0x1a4bb45b90>
iris[ 'feature_names' ]
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
sns. pairplot( iris_x. join( iris_y) , vars = [ 'sepal length (cm)' , 'petal length (cm)' ] ,
kind = 'reg' , diag_kind= "kde" ,
hue= 0 , palette= "husl" )
<seaborn.axisgrid.PairGrid at 0x1a41860a90>
4.分类数据可视化
data[ 'feature_names' ]
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
4.1 分类散点图
sns. stripplot( x= "CHAS" ,
y= 0 ,
data= x_df. join( y_df) ,
jitter = True ,
size = 5 , edgecolor = 'w' , linewidth= 1 , marker = 'o'
)
<matplotlib.axes._subplots.AxesSubplot at 0x1a3d276910>
ZN
sns. stripplot( x= "CHAS" ,
y= 0 ,
hue= "RAD" ,
data= x_df. join( y_df) ,
jitter= True )
<matplotlib.axes._subplots.AxesSubplot at 0x1a43defc10>
sns. stripplot( x= "RAD" ,
y= 0 ,
hue= "CHAS" ,
data= x_df. join( y_df) ,
jitter= True ,
palette= "Set2" ,
dodge= True ,
)
<matplotlib.axes._subplots.AxesSubplot at 0x1a43defa10>
print ( x_df[ 'RAD' ] . value_counts( ) )
sns. stripplot( x= 'RAD' , y= 0 , data= x_df. join( y_df) , jitter = True ,
order = [ 4.0 , 5.0 , 24.0 ] )
24.0 132
5.0 115
4.0 110
3.0 38
6.0 26
8.0 24
2.0 24
1.0 20
7.0 17
Name: RAD, dtype: int64
<matplotlib.axes._subplots.AxesSubplot at 0x1a394e9490>