一、Pandas简介
Pandas 是 Python 的核心数据分析支持库,主要数据结构是 Series(一维数据)与 DataFrame(二维数据)。
序列(Series): pd.Series(data, index=index)
字典可以实例化成序列:pd.Series(dict)
In[1]:d=np.random.sample(size=3)
In[2]:f=pd.Series(d,index=[1,2,3])
Out[2]:
1 0.671804
2 0.710016
3 0.540742
dtype: float64
数据框(DataFrame): 是由多种类型的Series构成的二维标签数据结构
In[1]:d={"name":pd.Series(['Tom','Vera','Jerry'],index=['1','2','3']),"age":pd.Series([21,23,20],index=['1','2','3'])}
In[2]:f=pd.DataFrame(d)
Out[2]:
name age
1 Tom 21
2 Vera 23
3 Jerry 20
In[3]:f['sex']=['M',"F",'M']#添加列
Out[3]:
name age sex
1 Tom 21 M
2 Vera 23 F
3 Jerry 20 M
In[1]:date=pd.date_range('20210121',periods=6)#构造面板数据
In[2]:f=pd.DataFrame(np.random.randn(6,4),index=date,columns=list('abcd'))
Out[2]:
a b c d
2021-01-21 0.675991 0.682510 1.762832 -0.656490
2021-01-22 -0.013504 1.274871 -0.445489 1.110992
2021-01-23 0.359835 1.382179 -1.121142 0.508935
2021-01-24 -0.465819 -0.297656 -0.762458 -1.133300
2021-01-25 -2.504541 0.932503 0.194885 -0.206092
2021-01-26 -1.628215 -0.798137 -0.092357 -1.876833
In[6]:f=pd.DataFrame({'A':[1,2,3],'B':pd.Series(range(4,7)),'C':'统计'})#用字典对象生成DataFrame
Out[6]:
A B C
0 1 4 统计
1 2 5 统计
2 3 6 统计
NaN(Not a Number)表示缺失数据。
二、Pandas基础操作
1. 数据读取
1.1 导入库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
path='E:/data.csv'
pd.read_csv(path,r,enconding='utf-8')
read_csv()常用参数
参数 | 含义 |
---|---|
header=0 | 首行设为列名 |
header=None | 文件不包含列名,传入列名明确值 |
skip_blank_lines=True | 忽略空行和注释行 |
names | 列名列表 |
index_col | 行索引列表 |
use_cols | 返回列名列表的子集 |
1.2 数据预览
df.head(n)#预览前n行 默认显示5条数据,也可以指定显示数据的数量。
df.tail(n)#预览最后n行
df.describe()#描述统计
df.index#显示索引
df.columns#显示列名
df.shape#输出轴纬度
df.T#转置数据
df.dtypes#查看数据框每列数据类型
df.sort_index(axis=0,asceding=Flase)#按轴排序
df.sort_values(by='columns')#按值排序
In[3]:f.describe()
Out[3]:
a b c d
count 6.000000 6.000000 6.000000 6.000000
mean -0.596042 0.529378 -0.077288 -0.375465
std 1.232544 0.884903 1.015383 1.088834
min -2.504541 -0.798137 -1.121142 -1.876833
25% -1.337616 -0.052614 -0.683216 -1.014098
50% -0.239662 0.807507 -0.268923 -0.431291
75% 0.266500 1.189279 0.123075 0.330178
max 0.675991 1.382179 1.762832 1.110992
In[4]:f.sort_index(axis=0,ascending=False)
Out[4]:
a b c d
2021-01-26 -1.628215 -0.798137 -0.092357 -1.876833
2021-01-25 -2.504541 0.932503 0.194885 -0.206092
2021-01-24 -0.465819 -0.297656 -0.762458 -1.133300
2021-01-23 0.359835 1.382179 -1.121142 0.508935
2021-01-22 -0.013504 1.274871 -0.445489 1.110992
2021-01-21 0.675991 0.682510 1.762832 -0.656490
In[5]f.sort_values(by='b',ascending=False)
Out[5]:
a b c d
2021-01-23 0.359835 1.382179 -1.121142 0.508935
2021-01-22 -0.013504 1.274871 -0.445489 1.110992
2021-01-25 -2.504541 0.932503 0.194885 -0.206092
2021-01-21 0.675991 0.682510 1.762832 -0.656490
2021-01-24 -0.465819 -0.297656 -0.762458 -1.133300
2021-01-26 -1.628215 -0.798137 -0.092357 -1.876833
2.数据选取
“列选取”
f.loc[:,['a','b']]#选取某几列,返回数据框
Out[30]:
a b
2021-01-21 -1.906501 -0.189475
2021-01-22 -0.592633 -0.139068
2021-01-23 -0.796397 -0.791442
2021-01-24 -1.573897 -1.600496
2021-01-25 -1.090613 -1.757387
2021-01-26 -0.480951 0.886467
In[34]:f['a']#选取单列产生Series,与df.a等效
Out[34]:
2021-01-21 0.675991
2021-01-22 -0.013504
2021-01-23 0.359835
2021-01-24 -0.465819
2021-01-25 -2.504541
2021-01-26 -1.628215
Freq: D, Name: a, dtype: float64
“行切片”
f.loc['20210125']#选择某行
Out[35]:
a -1.090613
b -1.757387
c 0.720606
d 2.176183
Name: 2021-01-25 00:00:00, dtype: float64
f.iloc[3]#提取某行
Out[56]:
a -1.573897
b -1.600496
c 0.144396
d -0.376654
Name: 2021-01-24 00:00:00, dtype: float64
In[35]:f[0:3]#用[]进行行切片
Out[35]:
a b c d
2021-01-21 0.675991 0.682510 1.762832 -0.656490
2021-01-22 -0.013504 1.274871 -0.445489 1.110992
2021-01-23 0.359835 1.382179 -1.121142 0.508935
In[37]:f['20210121':'20210123']#行切片
Out[37]:
a b c d
2021-01-21 0.675991 0.682510 1.762832 -0.656490
2021-01-22 -0.013504 1.274871 -0.445489 1.110992
2021-01-23 0.359835 1.382179 -1.121142 0.508935
f.iloc[3:5, 0:2]#选择某连续几行某连续几列
Out[60]:
a b
2021-01-24 -1.573897 -1.600496
2021-01-25 -1.090613 -1.757387
f.iloc[[1, 2, 4], [0, 2]]#选择某特定几行某特定几列
Out[61]:
a c
2021-01-22 -0.592633 0.920918
2021-01-23 -0.796397 0.621148
2021-01-25 -1.090613 0.720606
f.loc[date[0],'a']#提取标量值
Out[54]: -1.906500966022293
f.iloc[1, 1]#提取标量值
Out[62]: -0.1390677596717857
f.loc[para1,para2]
按标签切片,参数1a:b
定位行 参数2[a,b]
定位列切片
f.iloc[para1,para2]
按位置切片,参数1num1:numn
或[num3,num4]
定位行 参数2num1:numn
或[num5,num6]
定位列切片
“条件选取”
f[f.d > 0]#用单列的值选择数据
Out[66]:
a b c d
2021-01-21 -1.906501 -0.189475 -0.172290 2.381249
2021-01-22 -0.592633 -0.139068 0.920918 1.585921
2021-01-23 -0.796397 -0.791442 0.621148 0.227825
2021
f[f > 0]#选择 DataFrame 里满足条件的值
Out[68]:
a b c d
2021-01-21 NaN NaN NaN 2.381249
2021-01-22 NaN NaN 0.920918 1.585921
2021-01-23 NaN NaN 0.621148 0.227825
2021-01-24 NaN NaN 0.144396 NaN
2021-01-25 NaN NaN 0.720606 2.176183
2021-01-26 NaN 0.886467 0.868596 0.111140
f2=f.copy()
f2['e']=['one', 'one', 'two', 'three', 'four', 'three']
f2[f2['e'].isin(['two','four'])]#用 isin() 筛选
Out[71]:
a b c d e
2021-01-23 -0.796397 -0.791442 0.621148 0.227825 two
2021-01-25 -1.090613 -1.757387 0.720606 2.176183 four
“赋值”
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20210121', periods=6))
f['f'] = s1#用索引自动对齐新增列的数据
f
Out[78]:
a b c d f
2021-01-21 -1.906501 -0.189475 -0.172290 2.381249 1
2021-01-22 -0.592633 -0.139068 0.920918 1.585921 2
2021-01-23 -0.796397 -0.791442 0.621148 0.227825 3
2021-01-24 -1.573897 -1.600496 0.144396 -0.376654 4
2021-01-25 -1.090613 -1.757387 0.720606 2.176183 5
2021-01-26 -0.480951 0.886467 0.868596 0.111140 6
f.at[date[0], 'a'] = 0#按标签赋值
f
Out[81]:
a b c d f
2021-01-21 0.000000 -0.189475 -0.172290 2.381249 1
2021-01-22 -0.592633 -0.139068 0.920918 1.585921 2
2021-01-23 -0.796397 -0.791442 0.621148 0.227825 3
2021-01-24 -1.573897 -1.600496 0.144396 -0.376654 4
2021-01-25 -1.090613 -1.757387 0.720606 2.176183 5
2021-01-26 -0.480951 0.886467 0.868596 0.111140 6
f.iat[0, 1] = 0#按位置赋值
f
Out[83]:
a b c d f
2021-01-21 0.000000 0.000000 -0.172290 2.381249 1
2021-01-22 -0.592633 -0.139068 0.920918 1.585921 2
2021-01-23 -0.796397 -0.791442 0.621148 0.227825 3
2021-01-24 -1.573897 -1.600496 0.144396 -0.376654 4
2021-01-25 -1.090613 -1.757387 0.720606 2.176183 5
2021-01-26 -0.480951 0.886467 0.868596 0.111140 6
f.loc[:, 'd'] = np.array([5] * len(f))#按 NumPy 数组赋值
f
Out[85]:
a b c d f
2021-01-21 0.000000 0.000000 -0.172290 5 1
2021-01-22 -0.592633 -0.139068 0.920918 5 2
2021-01-23 -0.796397 -0.791442 0.621148 5 3
2021-01-24 -1.573897 -1.600496 0.144396 5 4
2021-01-25 -1.090613 -1.757387 0.720606 5 5
2021-01-26 -0.480951 0.886467 0.868596 5 6
f2=f.copy()#用 where 条件赋值
f2[f2>0]=-f2
f2
Out[89]:
a b c d f
2021-01-21 0.000000 0.000000 -0.172290 -5 -1
2021-01-22 -0.592633 -0.139068 -0.920918 -5 -2
2021-01-23 -0.796397 -0.791442 -0.621148 -5 -3
2021-01-24 -1.573897 -1.600496 -0.144396 -5 -4
2021-01-25 -1.090613 -1.757387 -0.720606 -5 -5
2021-01-26 -0.480951 -0.886467 -0.868596 -5 -6
“Apply函数”
f.apply(np.cumsum)
Out[4]:
a b c d
2021-01-21 -0.197920 0.377108 0.395226 0.806494
2021-01-22 1.315655 -1.364816 2.054561 -0.057621
2021-01-23 0.329961 -1.400136 1.442244 -0.928451
2021-01-24 1.039853 -1.661710 2.715270 0.071924
2021-01-25 0.875996 -1.787579 3.221671 -0.386586
2021-01-26 1.255793 0.052887 4.751565 -0.030912
f.apply(lambda x: x.max()-x.min())
Out[5]:
a 2.499268
b 3.582389
c 2.271652
d 1.871204
dtype: float64
“组合”
f=pd.Series(np.random.randint(1,6,6),index=range(0,6))
Out[30]:
0 3
1 3
2 3
3 1
4 1
5 2
dtype: int32
f.value_counts()#序列的计数
Out[31]:
3 3
1 2
2 1
dtype: int64
f=pd.Series(['aa','bb','cc'])#序列的字符串字母全部大写
f.str.upper()
Out[34]:
0 AA
1 BB
2 CC
dtype: object
pd.concat()#数据框行拼接
pd.merge(dataframe1,dataframe2,on='col')#数据框合并
pd.MultiIndex.frm_tuples()#多层索引
pd.pivot_table()#生成数据透视表
df.append()#数据框追加行
df.groupby().sum() #根据某列的值进行分组后求和
df.stack()#压缩数据框的列
参考网站: