上周学习到python的类的时候停下了脚步,想看一下python数据分析方面的知识,因为上周被女朋友这方面的知识,感觉自己啥也不懂。好了废话少说。
1.matolotlib
import matplotlib.pyplot as plt
>>> plt.plot([2,3,4,5])
[<matplotlib.lines.Line2D object at 0x0000015B7DB51FD0>]
>>> plt.ylabel('grade')
Text(0, 0.5, 'grade')
>>> plt.show()
>>> plt.plot([1,2,3],[7,9,9])
[<matplotlib.lines.Line2D object at 0x0000015B7D5823C8>]
>>> plt.ylabel('Grade')
Text(0, 0.5, 'Grade')
>>> plt.axis([-1,5,0,10])
[-1, 5, 0, 10]
>>> plt.show()
pyplot 绘图区域,plt.subplot(nrows,ncols,plot_number)
plt.plot(x,y,format_string,**kwargs) x:x轴数据,列表或数组,可选;y同x;format_string:控制曲线的格式字符串,可选。**kwargs:第二组或者更多(x,y,format_string) 当绘制多条曲线时,各条曲线的x不能省略)
format_string:控制曲线的格式字符串,可选由颜色字符、风格字符和标计字符组成
‘b’蓝色 ’m’洋红色 ‘g’绿色 ’y’黄色 ‘r’红色 ’k’黑色 ‘c’青绿色 ’w’白色
‘'实线;’__'破折线 “." 点划线 ”:“虚线 ”“”“无线条;
‘.’ 点标记 ‘,’像素标及 ‘o’实心圈标记 ’v倒三角标记 ‘^'上三角标记 ’>'右三角标记‘<'左三角标记
**kwargs:第二组或更多(x,y,format_string) color = ‘green’;linestyle =‘dashed’,arker=‘o’,markerfacecolor =‘blue’ 分别为控制颜色、线条风格、标记风格和标记颜色。
plt.plot(a,a*1.5,'go-',a,a*2.5,'rx',a,a*3.5,'*')
2.pyplot并不默认支持中文显示,需要rcParams修改字体实现;
matplotlib.rcParams[‘font.faily’]=‘SimHei’,这个会改变全局得字体。
‘font.family’用于显示字体得名字; ‘font.style’字体风格,’normal’ ‘italic’
‘font。size’字体大小,整数字
'SimHei’中文黑体 'Kaiti‘ 中文楷体 'LiSu’中文隶书
matplotlib.rcParams['font.family']='STSong'
>>> matplotlib.rcParams['font.size']=20
>>> a = np.arange(0.0,5.0,0.02)
>>> plt.xlabel('横轴:时间')
Text(0.5, 0, '横轴:时间')
>>> plt.ylabel('纵轴:振幅')
Text(0, 0.5, '纵轴:振幅')
>>> plt.plot(a,np.cos(2*np.pi*a),'r--')
[<matplotlib.lines.Line2D object at 0x0000015B7F618518>]
>>> plt.show()
在有中文输出得地方,增加一个属性:fontproperties
b = np.arange(0.0,1.0,0.2)
>>> plt.xlabel("横轴:时间",fontproperties = 'Simhei',fontsize=20)
Text(0.5, 0, '横轴:时间')
>>> plt.ylabel("纵轴:振幅",fontproperties = 'SimHei',fontsize=20)
Text(0, 0.5, '纵轴:振幅')
>>> plt.xlabel("横轴:时间",fontproperties = 'SimHei',fontsize=20)
Text(0.5, 0, '横轴:时间')
>>> plt.plot(a,np.cos(2*np.pi*a),'r--')
[<matplotlib.lines.Line2D object at 0x0000015B0014C780>]
>>> plt.show()
3.pyplot的文本显示函数
plt.xlabel() 对x轴增加文本标签;plt.ylabel()d对y轴增加文本标签
plt.title()对图形整体增加文本标签 plt.text()在任意位置增加文本
plt.annotate() 在图形中增加带箭头的注解
plt.annotate(s,xy = arrow_crd,xytext = text_crd, arrowprops = dict)
s 解释的字符串
xy 箭头位置
xytext 文本位置
arrowprops 显示属性
>>> plt.plot(a,np.cos(2*np.pi*a),'r--')
[<matplotlib.lines.Line2D object at 0x0000015B001A0358>]
>>> plt.xlabel("横轴:时间",fontproperties = 'SimHei',fontsize=20)
Text(0.5, 0, '横轴:时间')
>>> plt.ylabel("纵轴:振幅",fontproperties = 'SimHei',fontsize=20)
Text(0, 0.5, '纵轴:振幅')
>>> plt.title(r'正弦波实例 $y=cos(2\pi x)$',fontproperties = 'SimHei',fontsize=25)
Text(0.5, 1.0, '正弦波实例 $y=cos(2\\pi x)$')
>>> plt.annotate(r'$\mu=100$',xy=(0.5,0.5),xytext = (0.8,1.0),arrowprops(facecolor = 'black',shrink=0.1,width=2))
SyntaxError: positional argument follows keyword argument
>>> plt.annotate(r'$\mu=100$',xy=(0.5,0.5),xytext = (0.8,1.0),arrowprops(facecolor = 'black',shrink=0.1,width=2))
> plt.axis([-1,1,-1,1])
[-1, 1, -1, 1]
>>> plt.grid(True)
>>> plt.show()
4.plt.subplot2grid()
plt.subplot2grid(GridSpec,CurSpec,colspan=1,rowspan =1)
理念:设定网格,选中网格,确定选中行列区域数量,编号从0开始。
plt.subplot2grid((3,3),(1,0),colspan=2) (3,3)是分成3*3的区域,(1,0)表示选中的区域,colspan表示列的延伸的长度。
plt.subplot2grid((3,3),(0,0),colspan =3);
plt.subplot2grid((3,3),(1,0),colspan=2);
plt.subplot2grid((3,3)(1,2),rowspan=2);
plt.subplot2grid((3,3),(2,0))
plt.subplot2grid((3,3),(2,1))
等价于:
import matplotlib.gridspec as gridspec
gs = gridspec.GridSpec(3,3)
ax1 = plt.subplot(gs[0,;])
ax2 = plt.subplot(gs[1,:-1])
ax3 = plt.subplot(gs[1:,-1])
4.plt.plot(x,y,fmt,…) 绘制一个坐标图;
plt.boxplot(data,notch,position) 绘制一个箱型图;
plt.bar(left,height,width,bottom)绘制一个条形图;
plt.barh(width,bottom,left,height)绘制一个横向条形图。
plt.polar(theta,r)绘制极坐标图
plt.pie(data,explore)绘制饼图。
plt.psd(x,NFFT=256,pad_to,Fs) 绘制功率谱密度图
plt.specgram(x,NFFT=256,pad_to,F)绘制谱图。
plt.cohere(x,y,NFFT=256,Fs) 绘制X-Y相关性函数,
plt.scatter(x,y) 绘制散点图,其中x-y长度相同。
plt.step(x,y,where) 绘制步阶图
plt.hist(x,bins,normed)绘制直方图。
5.饼图的绘制
abels = ''Frogs','Hogs','Dog','Logs'
SyntaxError: invalid syntax
>>> labels = 'Frogs','Hogs','Dog','Logs'
>>> sizes = [15,35,45,5]
>>> explode=(0,0.1,0,0)
SyntaxError: invalid character in identifier
>>> explode=(0,0.1,0,0)
>>> plt.pie(sizes,explode=explode,labels=labels,autopct='%1.1f%%',shadow=False,startangle=90)
([<matplotlib.patches.Wedge object at 0x0000015B7F618358>, <matplotlib.patches.Wedge object at 0x0000015B7F6187F0>, <matplotlib.patches.Wedge object at 0x0000015B7D8CF588>, <matplotlib.patches.Wedge object at 0x0000015B7D8CFEB8>], [Text(-0.4993895680663527, 0.9801071672559598, 'Frogs'), Text(-1.0692078188246834, -0.5447886197087484, 'Hogs'), Text(1.086457168210212, -0.17207795223283906, 'Dog'), Text(0.1720779903783871, 1.0864571621685486, 'Logs')], [Text(-0.2723943098543742, 0.5346039094123416, '15.0%'), Text(-0.6237045609810652, -0.31779336149676984, '35.0%'), Text(0.5926130008419337, -0.0938607012179122, '45.0%'), Text(0.09386072202457478, 0.592612997546481, '5.0%')])
>>> plt.show()
直方图的绘制
plt.pie(sizes,explode=explode,labels=labels,autopct='%1.1f%%',shadow=False,startangle=90)
([<matplotlib.patches.Wedge object at 0x0000015B7D582DA0>, <matplotlib.patches.Wedge object at 0x0000015B7AFBD828>, <matplotlib.patches.Wedge object at 0x0000015B7AFBD978>, <matplotlib.patches.Wedge object at 0x0000015B7AFBDD68>], [Text(-0.4993895680663527, 0.9801071672559598, 'Frogs'), Text(-1.0692078188246834, -0.5447886197087484, 'Hogs'), Text(1.086457168210212, -0.17207795223283906, 'Dog'), Text(0.1720779903783871, 1.0864571621685486, 'Logs')], [Text(-0.2723943098543742, 0.5346039094123416, '15.0%'), Text(-0.6237045609810652, -0.31779336149676984, '35.0%'), Text(0.5926130008419337, -0.0938607012179122, '45.0%'), Text(0.09386072202457478, 0.592612997546481, '5.0%')])
>>> plt.show()
>>> np.random.seed(0)>>>
>>> labels = ''Frogs','Hogs','Dog','Logs'
SyntaxError: invalid syntax
>>> mu,sigma = 100,20
>>> a = np.random.normal(mu,sigma,size=100)
>>> plt.hist(a,20,normed = 1,histtype = 'stepfilled',facecolor='b',alpha =0.75)
Warning (from warnings module):
File "__main__", line 1
MatplotlibDeprecationWarning:
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
(array([0.00186103, 0. , 0. , 0.00744411, 0.00558308,
0.00186103, 0.00930513, 0.00558308, 0.01302719, 0.02419335,
0.01861027, 0.02233232, 0.00930513, 0.01674924, 0.01861027,
0.00930513, 0.01116616, 0.00558308, 0.00186103, 0.00372205]), array([ 37.6140161 , 42.98739418, 48.36077226, 53.73415033,
59.10752841, 64.48090649, 69.85428456, 75.22766264,
80.60104072, 85.97441879, 91.34779687, 96.72117495,
102.09455302, 107.4679311 , 112.84130918, 118.21468725,
123.58806533, 128.96144341, 134.33482148, 139.70819956,
145.08157764]), <a list of 1 Patch objects>)
>>> plt.title('Histogram')
Text(0.5, 1.0, 'Histogram')
>>> plt.show()
plt.show()
>>> plt.hist(a,10,normed = 1,histtype = 'stepfilled',facecolor='b',alpha =0.75)#10表示直方的个数
(array([0.00093051, 0.00372205, 0.00372205, 0.00744411, 0.01861027,
0.02047129, 0.01302719, 0.0139577 , 0.00837462, 0.00279154]), array([ 37.6140161 , 48.36077226, 59.10752841, 69.85428456,
80.60104072, 91.34779687, 102.09455302, 112.84130918,
123.58806533, 134.33482148, 145.08157764]), <a list of 1 Patch objects>)
#normed为1表示百分比,0为直方图个数
>>> plt.show()
极坐标图的绘制
N = 20
theta = np.linspace(0.0,2*np.pi,N,endpoint=False)
radii = 10*np.random.rnd(N)
Traceback (most recent call last):
File “<pyshell#120>”, line 1, in
radii = 10*np.random.rnd(N)
AttributeError: module ‘numpy.random’ has no attribute ‘rnd’
radii = 10*np.random.rand(N)
width = np.pi/4*np.random.rand(N)
ax = plt.subplot(111,projection =‘polar’)
bars = ax.bar(theta,radii,width=width,bottom=0.0)
for r,bar in zip(radii,bars):
bar.set_facecolor(plt.cm.viridis(r/10.))
bar.set_alpha(0.5)
plt.show() #重点留意theta , radill,width这三个参数的用法。
散点图的绘制
`>>> import numpy as np
import matplotlib.pyplot as plt
fig,ax = plt.subplots()
ax.plot(10np.random.randn(100),10np.random.randn(100),‘o’)
ax.set_title(‘Simple Scatter’)
Text(0.5, 1.0, ‘Simple Scatter’)
plt.show()
`
import pandas as pd
a =pd.Series([9,6,7,8])
a
0 9
1 6
2 7
3 8
dtype: int64b= pd.Series([9,8,74],index=[‘a’,‘b’,‘c’])
b
a 9
b 8
c 74
dtype: int64
之前学习的numpy是针对数据的存取,然后进行操作,而pandas是针对充分利用数据的索引。pandas中的数据类型为series和data Frame;
7.series类型可以由如下类型进行创建:
列表 标量值 python字典 ndarray
其它函数创建
标量值:`s = pd.Series(25,index=[‘a’,‘b’])s
a 25
b 25
dtype: int64
字典创建:
>>> d = pd.Series({'a':9,'b':10})
>>> d
a 9
b 10
dtype: int64
>>> d = pd.Series({'a':9,'b':10},index=['b','a','f'])
>>> d
b 10.0
a 9.0
f NaN
dtype: float64
ndarray创建
import numpy as np
>>> n = pd.Series(np.arange(6))
>>> n
0 0
1 1
2 2
3 3
4 4
5 5
dtype: int32
>>>
>>>> n = pd.Series(np.arange(6),index=['c','f','g','r','y','u'])
>>> n
c 0
f 1
g 2
r 3
y 4
u 5
dtype: int32
8.对series的操作,索引,而且具有默认的下标索引
n.index
Index([‘c’, ‘f’, ‘g’, ‘r’, ‘y’, ‘u’], dtype=‘object’)n.values
array([0, 1, 2, 3, 4, 5])n[2]
2n[‘g’]
2
>>> b = pd.Series([9,8,7,6],index = ['a','b','c','d'])
>>> b
a 9
b 8
c 7
d 6
dtype: int64
>>> b[3]
6
>>> b[:3]
a 9
b 8
c 7
dtype: int64
>>> b[b>b.median()]
a 9
b 8
dtype: int64
>>> np.exp(b)
a 8103.083928
b 2980.957987
c 1096.633158
d 403.428793
dtype: float64
>>> 'd' in b
True
>>> 'p' in b #判断是否存在该索引,跟字典的用法类似
False
>>> b.get('f',100)
>>> b.get('c',100)
7 #利用字典的索引取值。
9.Series类型对齐操作:
>>> import pandas as pd
>>> a = pd.Series([1,2,3],['c','d','e'])
>>> b = pd.Series([9,8,7,6],['a','b','c','d'])
>>> a+b
a NaN
b NaN
c 8.0
d 8.0
e NaN
dtype: float64
10.Series类型的name属性,Series对象和索引都可以有一个名字,存储在属性.name当中。
>>> b.name = 'Series 对象'
>>> b.index.name = '索引列'
>>>> b
索引列
a 9
b 8
c 7
d 6
Name: Series 对象, dtype: int64
Series类型的修改,Series对象可以随时修改并立即生效。Series 是一维带“标签”的数组,类似于字典的操作,但具有对其属性。
11.DataFrame:多列数据公用一组索引。表格类型。可以由:二维ndarray 、 一维ndarray、列表、字典、元组或Series创建。
从二维ndarray对象创建:
>>> import pandas as pd
>>> import numpy as np
>>> d = pd.DataFrame(np.arange(10).reshape(2,5))
>>> d
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
从一维ndarray对象字典创建:
dt = {'one':pd.Series([1,2,3],index=['a','b','c']),'two':pd.Series([9,8,7,6],index=['a','b','c','d'])}
>>> d = pd.DataFrame(dt)
>>> d
one two
a 1.0 9
b 2.0 8
c 3.0 7
d NaN 6
>>> pd.DataFrame(dt,index=['b','c','d'],columns=['two','three'])
two three
b 8 NaN
c 7 NaN
d 6 NaN
从列表类型的字典创建:
d1 = {'one':[1,2,3,4],'two':[9,8,7,6]}
>>> d = pd.DataFrame(d1,index = ['a','b','c','d'])
>>> d
one two
a 1 9
b 2 8
c 3 7
d 4 6
>>> d1 = {'城市':['北京','上海','广州','深圳','沈阳'],'环比':[101.5,101.2,101.3,102.0,100.1],'同比':[120.7,127.3,119.4,140.9,101.4]}
>>> d = pd.DataFrame(d1,index=['c1','c2','c3','c4','c5'])
>>> d
城市 环比 同比
c1 北京 101.5 120.7
c2 上海 101.2 127.3
c3 广州 101.3 119.4
c4 深圳 102.0 140.9
c5 沈阳 100.1 101.4
>>> d['c4']['环比']
Traceback (most recent call last):
File "D:\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1622, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'c4'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<pyshell#84>", line 1, in <module>
d['c4']['环比']
File "D:\Python36\lib\site-packages\pandas\core\frame.py", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "D:\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1622, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'c4'
>>> d['环比']['c4']
102.0
>>>
10.pd.reindex()能够改变或重排Series和DataFrame索引;
.reindex(index=[],columns = []…)的参数,
index,columns 新的行列自定义索引;
fill_value 重新索引中,用于填充缺失位置的值;
method 填充方法,ffill当前向前填充,bfill向后填充;
limit 最大量填充 ;copy默认为True,生成新的对象,False时,新旧相等不复制
Index(['城市', '环比', '同比', '定基'], dtype='object')
>>> newd = d.reindex(columns=newd,fill_value=[100,200,300,400,500])
>>> newd
城市 环比 同比 定基
c5 沈阳 100.1 101.4 [100, 200, 300, 400, 500]
c4 深圳 102.0 140.9 [100, 200, 300, 400, 500]
c3 广州 101.3 119.4 [100, 200, 300, 400, 500]
c2 上海 101.2 127.3 [100, 200, 300, 400, 500]
c1 北京 101.5 120.7 [100, 200, 300, 400, 500]
对索引的操作:
.append(idx) 连接另一个index对象,产生新的index对象;
.diff(idx) 计算差集,产生新的index对象;
.intersection(idx) 计算交集
.union(idx) 计算并集;
.delete(loc) 删除loc位置处的元素
.insert(loc,e)在loc位置增加一个元素e。
.drop()能够删除Series和DataFrame指定行或列索引
>>> a = pd.Series([9,8,7,6],index=['a','b','c','d'])
>>> a
a 9
b 8
c 7
d 6
dtype: int64
>>> a.drop(['a','c'])
b 8
d 6
dtype: int64
>>> d.drop('c5')
城市 环比 同比
c4 深圳 102.0 140.9
c3 广州 101.3 119.4
c2 上海 101.2 127.3
c1 北京 101.5 120.7
>>> d.drop('环比')
Traceback (most recent call last):
File "<pyshell#112>", line 1, in <module>
d.drop('环比')
File "D:\Python36\lib\site-packages\pandas\core\frame.py", line 3994, in drop
errors=errors,
File "D:\Python36\lib\site-packages\pandas\core\generic.py", line 3935, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "D:\Python36\lib\site-packages\pandas\core\generic.py", line 3969, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "D:\Python36\lib\site-packages\pandas\core\indexes\base.py", line 5018, in drop
raise KeyError(f"{labels[mask]} not found in axis")
KeyError: "['环比'] not found in axis"
>>> d.drop('环比','c5')
Traceback (most recent call last):
File "<pyshell#113>", line 1, in <module>
d.drop('环比','c5')
File "D:\Python36\lib\site-packages\pandas\core\frame.py", line 3994, in drop
errors=errors,
File "D:\Python36\lib\site-packages\pandas\core\generic.py", line 3922, in drop
axis_name = self._get_axis_name(axis)
File "D:\Python36\lib\site-packages\pandas\core\generic.py", line 420, in _get_axis_name
raise ValueError(f"No axis named {axis} for object type {cls}")
ValueError: No axis named c5 for object type <class 'pandas.core.frame.DataFrame'>
>>> d.drop('环比',axis=1)
城市 同比
c5 沈阳 101.4
c4 深圳 140.9
c3 广州 119.4
c2 上海 127.3
c1 北京 120.7
>>>
11.pandas库的数据运算:
算术运算根据行列索引,补齐后运算,运算默认产生浮点数。
补齐时缺项填充NaN空值;
二维和一维、一维和零维间为广播运算。
采用±*/符号进行的二元运算将产生新的对象。
比较运算法则:
比较运算只能比较相同索引的元素,不进行补齐;二维和一维、一维和领位间为广播运算;采用><>=!===等符号进行二元运算产生布尔运算。
12.pandas库的数据排序
.sort_index()方法在指定轴上根据索引进行排序,默认升序,
.sort_index(axis=0;ascening=True)
.sort_values()方法在指定轴根据数值进行排序,默认为升序
Series.sort_values(axis=0,ascending=True)
DataFrame.sort_values(by,axis=0,ascending=True)
NaN统一放到末尾排序;
统计分析函数
sun() .count()非Nan的数量;.mean(),median().var(),.std().min().max()
.argmin(),.argmax(),
.describe() 输出所有的统计值,得到series类型。
累计数据统计泛型
.cumsum() 依次给出前1,2…n个数的和;
.cumprod()依次给出前1,2,,,,n个数的积;
.cummax()依次给出1,2,…n个数的最大值
.cummin()依次给出前1,2,,n个数最小值
滚动计算,类似于移动平均法,但不能求后面的
.rolling(w).sun();.rolling(w).mean();.rolling(w).var(),.rolling(w).std()
.rolling(w).min.max()
数据的相关分析,Pearson相关系数
.cov()计算协方差矩阵;
.corr()计算相关系数矩阵,Person,Spearman、Kendall系数
后记:这篇博客几乎写了一周,不能说是一周,因为好久没有来写博客。又是一周,回过头来,看这周的轨迹,仿佛除了墨迹就是在墨迹的路上。下班后回来要做饭,吃完饭还要准备第二天的午饭。就是这样不停的墨迹,打开电脑,效率也开始变得低下。需要赶快调整一下,加油!!!