表格型数据结构
DataFrame 是一个表格型数据结构, 她含有一组有序的列, 每列可以是不同的值类型。(数值, 字符串, 布尔型等)。DataFrame既有行索引也有列索引,他可以看成是由Series组成的字典(共用同一个索引)。DataFrame 中的数据是以一个或者多个二位块存放的(而不是列表)不是列表, 字典或者别的一维数据结构。
构建DataFrame
1 最常用的是直接传入一个由等长列表或者Numpy 数组组成的字典。
from pandas import DataFrame
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame = DataFrame(data)
print(frame)
# pop state year
# 0 1.5 a 2000
# 1 1.7 a 2001
# 2 3.6 a 2002
# 3 2.4 b 2001
# 4 2.9 b 2002
可以看出在data 中键的顺序和打印的不同, 所以列是有序的。
如果指定了列就会按照指定列的顺序排列
直接指定 , 列按照指定顺序进行排列
就像Series 中指定列的顺序道理一样
from pandas import DataFrame
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop'])
print(frame2)
# year state pop
# 0 2000 a 1.5
# 1 2001 a 1.7
# 2 2002 a 3.6
# 3 2001 b 2.4
# 4 2002 b 2.9
缺省值
和Series中的一样, 如果传入的列在数据中找不到就会产生Na值
from pandas import DataFrame
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
print(frame2)
# year state pop deep
# one 2000 a 1.5 NaN
# two 2001 a 1.7 NaN
# three 2002 a 3.6 NaN
# four 2001 b 2.4 NaN
# five 2002 b 2.9 NaN
获取对应的列或者index
from pandas import DataFrame
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
print(frame2.values)
# [[2000 'a' 1.5 nan]
# [2001 'a' 1.7 nan]
# [2002 'a' 3.6 nan]
# [2001 'b' 2.4 nan]
# [2002 'b' 2.9 nan]]
print(frame2.columns)
# Index(['year', 'state', 'pop', 'deep'], dtype='object')
DataFrame 的列获取为一个Series
from pandas import DataFrame
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
print(frame2['state'])
# one a
# two a
# three a
# four b
# five b
# Name: state, dtype: object
print(type(frame2['state']))
# <class 'pandas.core.series.Series'>
print(frame2.year)
# one 2000
# two 2001
# three 2002
# four 2001
# five 2002
# Name: year, dtype: int64
通过打印出来的数据可以看出Series保留了原DataFrame相同的索引, 且name属性也已经 被相应的设置好了
以上是获取列的, 现在获取行
获取three 这一行的
from pandas import DataFrame
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
print(frame2)
# year state pop deep
# one 2000 a 1.5 NaN
# two 2001 a 1.7 NaN
# three 2002 a 3.6 NaN
# four 2001 b 2.4 NaN
# five 2002 b 2.9 NaN
print("========")
print(frame2.ix['three'])
# year 2002
# state a
# pop 3.6
# deep NaN
# Name: three, dtype: object
列通过赋值的的方式修改
from pandas import DataFrame
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
frame2['deep'] = 16.6
print(frame2)
# year state pop deep
# one 2000 a 1.5 16.6
# two 2001 a 1.7 16.6
# three 2002 a 3.6 16.6
# four 2001 b 2.4 16.6
# five 2002 b 2.9 16.6
也可以赋一组值
import numpy as np
frame2['deep'] = np.arange(5.)
print(frame2)
# year state pop deep
# one 2000 a 1.5 0.0
# two 2001 a 1.7 1.0
# three 2002 a 3.6 2.0
# four 2001 b 2.4 3.0
# five 2002 b 2.9 4.0
frame2['deep'] = [2, 3,10,2,99]
print(frame2)
# year state pop deep
# one 2000 a 1.5 2
# two 2001 a 1.7 3
# three 2002 a 3.6 10
# four 2001 b 2.4 2
# five 2002 b 2.9 99
特别注意
将列表或者数组赋值给某个列时候, 其长度必须和DataFrame的长度相匹配。 但是如果是个Series的话 , 就会精确匹配DateFame的索引, 所有的空位置都会被填上缺失值
from pandas import DataFrame, Series
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['deep'] = val
print(frame2)
# year state pop deep
# one 2000 a 1.5 NaN
# two 2001 a 1.7 -1.2
# three 2002 a 3.6 NaN
# four 2001 b 2.4 -1.5
# five 2002 b 2.9 -1.7
给deep这一列中 two four five 这三行赋值
为不存在的列赋值会创建一个新的列。 关键字del用于删除列
from pandas import DataFrame, Series
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['deep'] = val
frame2['eastern'] = frame2.state == 'b'
print(frame2)
# year state pop deep eastern
# one 2000 a 1.5 NaN False
# two 2001 a 1.7 -1.2 False
# three 2002 a 3.6 NaN False
# four 2001 b 2.4 -1.5 True
# five 2002 b 2.9 -1.7 True
创建一个新列eastern
给不存在的列赋值 创造一个新列
删除列
from pandas import DataFrame, Series
data = {
"state": ['a', 'a', 'a', 'b', 'b'],
"year": [2000, 2001, 2002, 2001, 2002],
"pop": [1.5, 1.7, 3.6, 2.4, 2.9]
}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'deep'],
index=['one', 'two', 'three', 'four', 'five'])
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['deep'] = val
frame2['eastern'] = frame2.state == 'b'
print(frame2)
del frame2['eastern']
print(frame2.columns)
# year state pop deep eastern
# one 2000 a 1.5 NaN False
# two 2001 a 1.7 -1.2 False
# three 2002 a 3.6 NaN False
# four 2001 b 2.4 -1.5 True
# five 2002 b 2.9 -1.7 True
# Index(['year', 'state', 'pop', 'deep'], dtype='object')
修改特性
通过索引方式修改的列只是相应数据的视图, 并不是副本。对返回的所有Series所做的任何就地修改全都会反映到源DataFrame上。 通过Series的copy方法可以显式的复制列
字典嵌套字典
如果将字典的字典一起传给DataFrame 就会被解释为:外层字典的键作为列, 内侧键作为行索引
from pandas import DataFrame, Series
pop = {'aa': {2001: 2.4, 2002: 2.9}, 'bb': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame = DataFrame(pop)
print(frame)
# aa bb
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6
对这个结果进行转置 T
frame2 = frame.T
frame2 = frame.T
print(frame2)
# 2000 2001 2002
#aa NaN 2.4 2.9
#bb 1.5 1.7 3.6
上面的打印效果可以看出, 内层字典的键会被合并,排序以形成最终的索引。 如果想定下来,就要显示的指定
from pandas import DataFrame, Series
pop = {'aa': {2001: 2.4, 2002: 2.9}, 'bb': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame = DataFrame(pop)
frame2 = DataFrame(pop, index=[2002, 2001, 2003])
print(frame)
print("========")
print(frame2)
# aa bb
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6
# ========
# aa bb
# 2002 2.9 3.6
# 2001 2.4 1.7
# 2003 NaN NaN
name 属性
如果设置了DataFrame的index和columns的属性,这些信息都会被显示出来, 下面是两者之间的打印效果
from pandas import DataFrame, Series
pop = {'aa': {2001: 2.4, 2002: 2.9}, 'bb': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame = DataFrame(pop)
print(frame)
print("=========")
frame.index.name = 'year'
frame.columns.name = 'state'
print(frame)
# aa bb
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6
# =========
# state aa bb
# year
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6
获取值
和Series 一样, values 属性会以二维的形式返回DataFrame 中的数据
frame.values