DataFrame
表示一种矩阵的数据表,数据存储在二维块,既有行索引又有列索引。
1.构建DataFrame,最常用的是利用等长度列表和Numpy数组字典的形式形成DataFrame:(所有列序的长度必须相等)
In [28]: data={'state':['one','two','three','four','five'],
...: 'year':[2000,2001,202,2003,2004],
...: 'pop':[1.5,4,5,9,10.2]}
In [29]: frame=pd.DataFrame(data)
另一种构造方法使用字典的嵌套:
In [56]: data1={'NAV':{1:2000,2:2020,3:2016},
...: 'ONI':{1:'HELLO',2:'ARE',3:'YOU',4:'BRO'}}
In [57]: frame3=pd.DataFrame(data1)
In [58]: frame3
Out[58]:
NAV ONI
1 2000.0 HELLO
2 2020.0 ARE
3 2016.0 YOU
4 NaN BRO
2.产生的DataFrame自动为Series分配索引,并顺序排序:(’‘索引可自定义,pd.DataFrame(data,index=[])’’)
In [30]: frame
Out[30]:
state year pop
0 one 2000 1.5
1 two 2001 4.0
2 three 202 5.0
3 four 2003 9.0
4 five 2004 10.2
3.如果指定列顺序,则DataFrame按照指定顺序排序:
In [31]: pd.DataFrame(data,columns=['year','state','pop'])
Out[31]:
year state pop
0 2000 one 1.5
1 2001 two 4.0
2 202 three 5.0
3 2003 four 9.0
4 2004 five 10.2
4.可以按列进行检索:
In [32]: frame.year
Out[32]:
0 2000
1 2001
2 202
3 2003
4 2004
Name: year, dtype: int64
In [33]: frame['year']
Out[33]:
0 2000
1 2001
2 202
3 2003
4 2004
Name: year, dtype: int64
也可进行修改:
In [34]: frame['year']=[2010,2012,2014,2016,2020]
In [35]: frame
Out[35]:
state year pop
0 one 2010 1.5
1 two 2012 4.0
2 three 2014 5.0
3 four 2016 9.0
4 five 2020 10.2
注意:frame[column]对于任意的列名有效,但是frame.column只能是列名符合Python变量名时有效。
5.可以进行按行检索:
In [38]: frame.loc[2]
Out[38]:
state three
year 2014
pop 5
Name: 2, dtype: object
6.增加一列:In [39]: frame2=pd.DataFrame(data,columns=[‘year’,‘state’,‘pop’,‘debt’])
In [40]: frame2
Out[40]:
year state pop debt
0 2000 one 1.5 NaN
1 2001 two 4.0 NaN
2 202 three 5.0 NaN
3 2003 four 9.0 NaN
4 2004 five 10.2 NaN
NaN为默认缺失。
7.将Series赋值给debt。
In [44]: val=pd.Series(['x','y','z'],index=[1,3,4])
In [45]: frame2['debt']=val
In [46]: frame2
Out[46]:
year state pop debt
0 2000 one 1.5 NaN
1 2001 two 4.0 x
2 202 three 5.0 NaN
3 2003 four 9.0 y
4 2004 five 10.2 z
Series的索引按照DataFrame的索引重新排序,缺失地方填充缺失。
8.如果被赋值的列不存在,即自动生成新的列,del关键字用来删除列,
如:增加一bool列,判断year是否为2016:
In [54]: frame['bool']=frame.year==2016
In [55]: frame
Out[55]:
state year pop bool
0 one 2010 1.5 False
1 two 2012 4.0 False
2 three 2014 5.0 False
3 four 2016 9.0 True
4 five 2020 10.2 False
注意:frame.column不能创建新的列。
从DataFrame选取的列视图,并不是拷贝。
9.DataFrame的values属性返回数据以二维的ndarray:
In [59]: frame3.values
Out[59]:
array([[2000.0, 'HELLO'],
[2020.0, 'ARE'],
[2016.0, 'YOU'],
[nan, 'BRO']], dtype=object)
10.pandas的索引对象:任意数组或者标签序列都可以作为索引对象
In [60]: label=pd.Index(np.arange(3))
In [61]: obj_1=pd.Series(['ONI','ENV','FRE'],index=label)
In [62]: obj_1
Out[62]:
0 ONI
1 ENV
2 FRE
dtype: object
append方法:将一个索引续接到另一个后面
In [63]: label_2=pd.Index(np.arange(3,6))
In [65]: label_2
Out[65]: Int64Index([3, 4, 5], dtype='int64')
In [67]: label_1=pd.Index(np.arange(3))
In [68]: label_1
Out[68]: Int64Index([0, 1, 2], dtype='int64')
In [69]: label_0=label_1.append(label_2)
In [70]: label_0
Out[70]: Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')