机器学习——数据科学包(第二节:Pandas快速入门)

1 Pandas主要模块Series/DataFrame

Pandas模块的数据结构主要有两种:1.Series 2.DataFrame
Series主要用于创建一维数据,DataFrame主要用于创建二维数据

import pandas as pd
import numpy as np
s=pd.Series([1,3,5,np.NaN,8,4])
print(s)
dates=pd.date_range('20200920',periods=6)
print(dates)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
print(data)
d={'A':1,'B':pd.Timestamp('20200920'),'C':range(4),'D':np.arange(4)}
print(d)
print(pd.DataFrame(d))
data.loc['20200920':'20200922']
data.iloc[2:4]
data.loc[:,['B','C']]
data.loc['20200920':'20200922',['B','C']]
data.loc['20200920',['B']]
输出结果:
0    1.0
1    3.0
2    5.0
3    NaN
4    8.0
5    4.0
dtype: float64
DatetimeIndex(['2020-09-20', '2020-09-21', '2020-09-22', '2020-09-23',
               '2020-09-24', '2020-09-25'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2020-09-20  0.534018  1.001680 -0.267684  0.477595
2020-09-21  1.681878 -1.461280 -0.177759 -1.196249
2020-09-22 -0.163594  0.330268 -0.498839 -0.228518
2020-09-23  0.564606  0.575666 -0.807074  1.222299
2020-09-24 -1.462290  1.272501 -0.562305  0.743939
2020-09-25  1.005620  0.252503 -1.075365 -0.639834
{'A': 1, 'B': Timestamp('2020-09-20 00:00:00'), 'C': range(0, 4), 'D': array([0, 1, 2, 3])}
   A          B  C  D
0  1 2020-09-20  0  0
1  1 2020-09-20  1  1
2  1 2020-09-20  2  2
3  1 2020-09-20  3  3
	A	B	C	D
2020-09-20	0.447708	-0.527595	3.263154	0.857480
2020-09-21	-0.621859	-0.932621	-0.647323	-0.358653
2020-09-22	0.508061	-0.890667	0.445783	1.069888
	B	C
2020-09-20	-0.527595	3.263154
2020-09-21	-0.932621	-0.647323
2020-09-22	-0.890667	0.445783
B   -0.527595
Name: 2020-09-20 00:00:00, dtype: float64

2 Jupyter Notebook

Jupyter Notebook相比于其他python编辑软件来说优势在于网页化与图形化,比如,在jupyternotebook里面输出一个带标签的矩阵,输出结果为经过处理之后的表格形式,这就是jupyternotebook能够带来的图形化界面,并且可以直接把运行的图形放在网页上。和numpy库与pandas库优势在于高性能科学计算与数据分析,其他的科学计算包基本都是基于numpy。
在这里插入图片描述
与python中的语句一致,pandas也有许多的功能函数可以供我们调用,上图中则是一些语句简单的使用案例

3 Pandas快速入门(一)

import numpy as np
import pandas as pd
import matplotlib as plt
dates=pd.date_range('20200920',periods=6)
data=pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
data
ABCD
2020-09-200.637806-0.2953622.5435530.267552
2020-09-21-0.3738380.5721710.2344980.385977
2020-09-221.499670-0.063495-0.012278-1.686829
2020-09-23-0.249202-0.6435030.0873280.159256
2020-09-241.487515-0.053663-0.045310-2.220774
2020-09-250.902363-1.0785153.715075-0.777603
data1=data.reindex(index=dates[2:5],columns=list(data.columns)+['E'])#为第2-4行添加一列‘E’
data1
ABCDE
2020-09-221.499670-0.063495-0.012278-1.686829NaN
2020-09-23-0.249202-0.6435030.0873280.159256NaN
2020-09-241.487515-0.053663-0.045310-2.220774NaN
data1.loc[dates[2:4],'E']=2#将第2-3行E列的数据替换为2
data1
ABCDE
2020-09-221.499670-0.063495-0.012278-1.6868292.0
2020-09-23-0.249202-0.6435030.0873280.1592562.0
2020-09-241.487515-0.053663-0.045310-2.220774NaN
data1.dropna()
ABCDE
2020-09-221.499670-0.063495-0.012278-1.6868292.0
2020-09-23-0.249202-0.6435030.0873280.1592562.0
data1.fillna(value=3)
ABCDE
2020-09-221.499670-0.063495-0.012278-1.6868292.0
2020-09-23-0.249202-0.6435030.0873280.1592562.0
2020-09-241.487515-0.053663-0.045310-2.2207743.0
pd.isnull(data1)#判断表格中的每一个值是否为空值
ABCDE
2020-09-22FalseFalseFalseFalseFalse
2020-09-23FalseFalseFalseFalseFalse
2020-09-24FalseFalseFalseFalseTrue
pd.isnull(data1).any().any()#判断是否有空值
True
data1.mean()#求列平均值,空数据不参与计算
A    0.912661
B   -0.253553
C    0.009913
D   -1.249449
E    2.000000
dtype: float64
data1.mean(axis=1)#求行平均值
2020-09-22    0.347414
2020-09-23    0.270776
2020-09-24   -0.208058
Freq: D, dtype: float64
data1.cumsum()#求行和
ABCDE
2020-09-221.499670-0.063495-0.012278-1.6868292.0
2020-09-231.250469-0.7069970.075050-1.5275734.0
2020-09-242.737984-0.7606600.029740-3.748347NaN

4 pandas快速入门(三)

4.1 数据合并

#设置为inline风格
%matplotlib inline 
#包导入
import numpy as np#123
import pandas as pd
import matplotlib.pyplot as plt
tuples=list(zip(*[['bar','bar','baz','baz','foo','foo','qux','qux'],['one','two','one','two','one','two','one','two']]))#定义一个元组
tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
index=pd.MultiIndex.from_tuples(tuples,names=['first','second'])#定义矩阵的行标签
df=pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])#输出8行2列的矩阵(该矩阵带标签)
df
AB
firstsecond
barone0.9738140.860556
two-0.3343030.930545
bazone0.518520-1.376495
two0.809283-2.076021
fooone0.9198090.330664
two-0.9792602.022618
quxone-0.215954-0.485860
two0.550512-1.902934
df.loc['bar']#挑选bar标签对应的数据
AB
second
one0.9738140.860556
two-0.3343030.930545
df.loc['bar'].loc['one'].loc['A']
0.973813613008033
stacked=df.stack()#将数据表格中的列标签转换为行标签
stacked
first  second   
bar    one     A    0.973814
               B    0.860556
       two     A   -0.334303
               B    0.930545
baz    one     A    0.518520
               B   -1.376495
       two     A    0.809283
               B   -2.076021
foo    one     A    0.919809
               B    0.330664
       two     A   -0.979260
               B    2.022618
qux    one     A   -0.215954
               B   -0.485860
       two     A    0.550512
               B   -1.902934
dtype: float64
df
AB
firstsecond
barone0.9738140.860556
two-0.3343030.930545
bazone0.518520-1.376495
two0.809283-2.076021
fooone0.9198090.330664
two-0.9792602.022618
quxone-0.215954-0.485860
two0.550512-1.902934

4.2 数据透视

数据透视表示只看表格中一部分的数据。

import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['one','one','two','three']*3,
                'B':['A','B','C']*4,
                'C':['foo','foo','foo','bar','bar','bar']*2,
                'D':np.random.randn(12),
                'E':np.random.randn(12)})#将集合输出为带标签的表格形式,关键字为列标签,数字为行标签,元素数量需一致
df
ABCDE
0oneAfoo-1.105957-0.747081
1oneBfoo0.0904601.635942
2twoCfoo0.5201901.534468
3threeAbar-0.217067-1.269738
4oneBbar-0.4522890.576699
5oneCbar-0.149624-0.019415
6twoAfoo0.695994-2.089837
7threeBfoo0.3779112.042727
8oneCfoo-0.195365-1.065980
9oneAbar1.9704891.340846
10twoBbar-0.9191740.920231
11threeCbar0.0765881.971652
pd.pivot_table(df,values='D',index=['A','B'],columns=['C'])#将A,B为行标签,C为列标签,挑选D对应的值
Cbarfoo
AB
oneA1.970489-1.105957
B-0.4522890.090460
C-0.149624-0.195365
threeA-0.217067NaN
BNaN0.377911
C0.076588NaN
twoANaN0.695994
B-0.919174NaN
CNaN0.520190
pd.pivot_table(df,values='E',index=['A'],columns=['C'])#将A设为行标签,C为列标签,挑选E对应的值
Cbarfoo
A
one0.632710-0.059040
three0.3509572.042727
two0.920231-0.277684
df[df.A=='one'].groupby('C').mean()#这条语句说明当行标签选中的值有2个的时候,pivot_table函数取得为平均值

DE
C
bar0.4561920.63271
foo-0.403621-0.05904

4.3 时间序列

import pandas as pd 
import numpy as np
rng=pd.date_range('20200927',periods=600,freq='s')#以秒为单位,输出600个时间序列
rng
DatetimeIndex(['2020-09-27 00:00:00', '2020-09-27 00:00:01',
               '2020-09-27 00:00:02', '2020-09-27 00:00:03',
               '2020-09-27 00:00:04', '2020-09-27 00:00:05',
               '2020-09-27 00:00:06', '2020-09-27 00:00:07',
               '2020-09-27 00:00:08', '2020-09-27 00:00:09',
               ...
               '2020-09-27 00:09:50', '2020-09-27 00:09:51',
               '2020-09-27 00:09:52', '2020-09-27 00:09:53',
               '2020-09-27 00:09:54', '2020-09-27 00:09:55',
               '2020-09-27 00:09:56', '2020-09-27 00:09:57',
               '2020-09-27 00:09:58', '2020-09-27 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')
ts=pd.Series(np.random.randint(0,500,len(rng)),index=rng)#输出0-500之间的数字,以rng为长度单位,rng的时间序列为标签,导出时间序列。
ts
2020-09-27 00:00:00    390
2020-09-27 00:00:01    259
2020-09-27 00:00:02    326
2020-09-27 00:00:03    102
2020-09-27 00:00:04     78
                      ... 
2020-09-27 00:09:55    252
2020-09-27 00:09:56     40
2020-09-27 00:09:57    448
2020-09-27 00:09:58    273
2020-09-27 00:09:59    262
Freq: S, Length: 600, dtype: int32

pd.Timestamp('20210212')-pd.Timestamp('20200927')#计算两个日期之间隔了几天
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-1-786a6a9c15a8> in <module>
----> 1 pd.Timestamp('20210212')-pd.Timestamp('20200927')#计算两个日期之间隔了几天


NameError: name 'pd' is not defined
df=pd.DataFrame({'id':[1,2,3,4,5,6],'rawgrade':['a','b','a','c','b','a']})
df
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-2-087d48564884> in <module>
----> 1 df=pd.DataFrame({'id':[1,2,3,4,5,6],'rawgrade':['a','b','a','c','b','a']})
      2 df


NameError: name 'pd' is not defined
df['grade']=df.rawgrade.astype('category')
df
idrawgradegrade
01aa
12bb
23aa
34cc
45bb
56aa
ts=pd.Series(np.random.randn(1000),index=pd.date_range('20000101',periods=1000))#以日期为标签取1000个随机数
ts
2000-01-01    0.531670
2000-01-02   -1.082891
2000-01-03   -0.160001
2000-01-04   -0.813438
2000-01-05    0.383497
                ...   
2002-09-22   -0.737740
2002-09-23   -1.852576
2002-09-24    0.767750
2002-09-25    1.456197
2002-09-26    0.843263
Freq: D, Length: 1000, dtype: float64
ts=ts.cumsum()#累加值
ts
2000-01-01      0.531670
2000-01-02     -0.019551
2000-01-03     -0.730773
2000-01-04     -2.255433
2000-01-05     -3.396595
                 ...    
2002-09-22    379.164625
2002-09-23    414.699047
2002-09-24    451.001219
2002-09-25    488.759588
2002-09-26    527.361220
Freq: D, Length: 1000, dtype: float64
ts.plot()#画出图形
<matplotlib.axes._subplots.AxesSubplot at 0x19b65a05948>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aywoeGCJ-1601297145212)(output_9_1.png)]

df=pd.DataFrame(np.random.randn(100,4),columns=list('ABCD'))
df
ABCD
00.1447891.3362770.028186-0.436991
1-1.5949370.264321-0.8464221.064381
20.332121-0.059087-1.6494101.969228
3-0.147231-0.981841-0.3599441.087520
4-0.906667-0.3361890.865446-1.639325
...............
95-0.1154631.348053-0.374024-1.214077
960.383091-0.349718-0.5639050.744625
971.146921-0.420216-0.6995820.223157
98-0.022097-0.333514-1.650627-0.370759
99-0.8208431.002717-1.0129660.479410

100 rows × 4 columns

4.4 数据载入与保存

import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.randn(100,4),columns=list('ABCD'))#创建一个列表矩阵
df
ABCD
0-0.271088-0.667500-0.239791-2.332722
1-0.5851850.020578-0.162059-2.674975
21.5814720.1079700.670809-2.321305
31.2541690.879124-0.006355-0.297117
41.455440-1.949867-1.5303750.474945
...............
950.5144490.2837880.872928-0.010145
96-0.9900310.320961-0.137568-2.621296
97-1.0381430.399518-0.850558-0.647029
98-0.003421-1.7098360.414999-0.025183
99-0.024097-1.1923590.3590180.393891

100 rows × 4 columns

df.to_csv('data.csv')#将列表写入data.csv文件中
%ls
 驱动器 C 中的卷是 Windows10
 卷的序列号是 BE1F-0990

 C:\Users\Administrator 的目录

2020/09/28 周一  21:17    <DIR>          .
2020/09/28 周一  21:17    <DIR>          ..
2020/07/11 周六  16:47    <DIR>          .anaconda
2020/07/18 周六  10:06    <DIR>          .astropy
2020/09/28 周一  21:18    <DIR>          .conda
2020/09/28 周一  20:32               228 .condarc
2020/07/18 周六  10:22    <DIR>          .config
2020/07/18 周六  15:45    <DIR>          .idlerc
2020/09/28 周一  20:59    <DIR>          .ipynb_checkpoints
2020/07/11 周六  11:04    <DIR>          .ipython
2020/07/12 周日  19:46    <DIR>          .jupyter
2020/07/18 周六  10:23    <DIR>          .keras
2020/07/18 周六  10:11    <DIR>          .matplotlib
2020/07/18 周六  09:51    <DIR>          .PyCharm2019.1
2020/08/14 周五  21:14    <DIR>          .pylint.d
2020/07/18 周六  10:24               183 .python_history
2020/07/11 周六  17:17    <DIR>          .spyder-py3
2020/07/11 周六  11:30    <DIR>          .vscode
2020/09/21 周一  19:42            21,524 132.ipynb
2020/09/20 周日  21:57            18,148 132-Copy1.ipynb
2020/09/14 周一  19:21    <DIR>          3D Objects
2020/09/14 周一  19:21    <DIR>          Contacts
2020/09/28 周一  21:17             8,277 data.csv
2020/09/28 周一  20:45    <DIR>          Desktop
2020/09/14 周一  19:21    <DIR>          Documents
2020/09/28 周一  20:43    <DIR>          Downloads
2020/09/14 周一  19:21    <DIR>          Favorites
2020/07/12 周日  20:02             1,294 helloworld.ipynb
2020/07/18 周六  10:15            50,628 java_error_in_pycharm_1464.log
2020/09/14 周一  19:21    <DIR>          Links
2020/09/14 周一  19:21    <DIR>          Music
2020/07/11 周六  09:23    <DIR>          OneDrive
2020/09/14 周一  19:21    <DIR>          Pictures
2020/09/14 周一  19:21    <DIR>          Saved Games
2020/09/14 周一  19:21    <DIR>          Searches
2020/07/25 周六  20:12               696 Untitled.ipynb
2020/09/20 周日  19:19             7,919 Untitled1.ipynb
2020/09/20 周日  21:18             8,317 Untitled2.ipynb
2020/09/21 周一  23:16            11,031 Untitled3.ipynb
2020/09/21 周一  22:35             4,927 Untitled4.ipynb
2020/09/22 周二  00:12            12,530 Untitled5.ipynb
2020/09/28 周一  20:58            31,565 Untitled6.ipynb
2020/09/28 周一  21:17             5,180 Untitled7.ipynb
2020/09/14 周一  19:21    <DIR>          Videos
              15 个文件        182,447 字节
              29 个目录 79,681,208,320 可用字节
%more data.csv  #获取data.CSV中的数据内容
pd.read_csv('data.csv',index_col=0)#将文件data.csv读出
ABCD
0-0.271088-0.667500-0.239791-2.332722
1-0.5851850.020578-0.162059-2.674975
21.5814720.1079700.670809-2.321305
31.2541690.879124-0.006355-0.297117
41.455440-1.949867-1.5303750.474945
...............
950.5144490.2837880.872928-0.010145
96-0.9900310.320961-0.137568-2.621296
97-1.0381430.399518-0.850558-0.647029
98-0.003421-1.7098360.414999-0.025183
99-0.024097-1.1923590.3590180.393891

100 rows × 4 columns

5 pandas核心数据结构

(1) Series
Series基本的数据结构为:
s=pd.Series(data,index=index)
创建一维带标签的数组,数组里可以放任意的数据(整数、浮点数、字符串、python对象)。
其中index是列表,用来作为数据的标签。data可以是不同的数据类型:
Python字典,ndarray对象,一个标量值

(2)DataFrame
是二维带行标签与列标签的数组,其基本数据结构为:
df=pd.DataFrame(data,index=index,columns=columns)
其中,index是行标签,columns是列标签,data可以为numpy组成的数组
DataFrame函数

import pandas as pd
import numpy as np
s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])#通过成随机矩阵生成带标签的数组
s
a    0.044486
b   -0.335713
c    0.361505
d    0.459095
e   -0.068854
dtype: float64
s[0]#查找数据
0.044486489059218585
s[1]
-0.33571303035986566
s[2]
0.3615048285007503
s=pd.Series({'a':1,'b':2,'c':3,'d':'4'},index=['a','b','c','h'])#通过集合生成带标签的数组
s
a      1
b      2
c      3
h    NaN
dtype: object
s=pd.Series(0,index=list('abcd'))#通过标量生成带标签的数组
s
a    0
b    0
c    0
d    0
dtype: int64
np.cos(s)
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64
np.exp(s)
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64
s1=pd.Series(np.random.randn(5),index=list('abcde'))
s2=pd.Series(np.random.randn(5),index=list('abcdf'))
print('{}\n{}'.format(s1,s2))
a   -0.210622
b    1.579821
c    0.590158
d    0.822265
e   -0.674539
dtype: float64
a    1.027758
b    0.238535
c   -0.524706
d    0.046141
f   -0.279048
dtype: float64
s1+s2#将两个表格相加
a    0.817136
b    1.818356
c    0.065452
d    0.868407
e         NaN
f         NaN
dtype: float64
d={'one':pd.Series([1,2,3,4],index=['a','b','c','d']),'two':pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])}#生成集合,集合的值为一维列表
d
{'one': a    1
 b    2
 c    3
 d    4
 dtype: int64,
 'two': a    1
 b    2
 c    3
 d    4
 e    5
 dtype: int64}
pd.DataFrame(d)
onetwo
a1.01
b2.02
c3.03
d4.04
eNaN5
pd.DataFrame(d,index=['a','b'])#行标签只取a,b两行
onetwo
a11
b22
pd.DataFrame(d,columns=['one'])#列标签只取one一行
one
a1
b2
c3
d4

6 Pandas基础运算

import pandas as pd
import numpy as np
s=pd.Series([1,3,5,6,8],index=list('acefh'))
s
a    1
c    3
e    5
f    6
h    8
dtype: int64
s.index
Index(['a', 'c', 'e', 'f', 'h'], dtype='object')
s.reindex(list('abcdefh'))
s
a    1
c    3
e    5
f    6
h    8
dtype: int64
s.reindex(list('abcdefh'),fill_value=0)
s
a    1
c    3
e    5
f    6
h    8
dtype: int64
a=pd.DataFrame([1,2,3],index=list('abc'),columns=list('a'))
a
a
a1
b2
c3
s.reindex(list('abcdefh'),method='ffill')#添加行标签,使得前面没有被定义的标签与上一行的标签值相同
0
a1
b1
c3
d3
e5
f6
h8
df=pd.DataFrame(np.random.randn(4,6),index=[1,2,3,4],columns=list('ABCDEF'))
df
ABCDEF
10.029065-1.4300841.129595-0.752968-1.547312-0.488080
2-1.653225-0.773183-1.0761982.002175-0.412101-1.152921
3-0.0902540.064623-0.4146850.585919-0.8554791.761726
40.609468-0.5599470.646330-0.7202291.554926-0.822189
df2=df.reindex([1,2,3,4,5,6],fill_value=0)
df2
ABCDEF
10.029065-1.4300841.129595-0.752968-1.547312-0.488080
2-1.653225-0.773183-1.0761982.002175-0.412101-1.152921
3-0.0902540.064623-0.4146850.585919-0.8554791.761726
40.609468-0.5599470.646330-0.7202291.554926-0.822189
50.0000000.0000000.0000000.0000000.0000000.000000
60.0000000.0000000.0000000.0000000.0000000.000000
df.drop('A',axis=1)
BCDEF
1-1.4300841.129595-0.752968-1.547312-0.488080
2-0.773183-1.0761982.002175-0.412101-1.152921
30.064623-0.4146850.585919-0.8554791.761726
4-0.5599470.646330-0.7202291.554926-0.822189
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值