1.Series
import numpy as np
from pandas import DataFrame,Series
a = np.array([1,2,3,4])
obj = Series(a)
print(obj)
0 1
1 2
2 3
3 4
在我们没有指定索引的时候,会自动生成0到N-1的索引
import numpy as np
from pandas import DataFrame,Series
a = np.array([1,2,3,4])
obj = Series(a)
print(obj.values) #values后面不可以加(),没有这种功能
print(obj.index)
[1 2 3 4]
RangeIndex(start=0, stop=4, step=1)
还可以自己加索引:
import numpy as np
from pandas import DataFrame,Series
a = np.array([1,2,3,4])
obj = Series(a, index=['d', 'b', 'a', 'c'])
print(obj)
print(obj.values)
print(obj.index)
d 1
b 2
a 3
c 4
[1 2 3 4]
Index(['d', 'b', 'a', 'c'], dtype='object')
可以通过索引值选取对应行的数:
print(obj['c']) #输出4
print(obj[ ['d','b','a'] ])
选取多个索引值时必须 print(obj[ ['d','b','a'] ]),不然会报错
d 1b 2
a 3
进行数组运算时都会保留索引和结果之间的关系:
print( obj[obj > 1] )
b 2
a 3
c 4
print( obj ** 2 )
d 1
b 4
a 9
c 16
还可以将Series看成一个字典,索引则可以看成是字典的关键字
print( 'b' in obj )
print('e' in obj )
True
False
如果数据是字典形式,也可以方便的变为Series形式:
import numpy as np
from pandas import DataFrame,Series
info = {'name':'jack', 'age':28, 'sex':'m'}
obj = Series(info)
print(obj)
age 28
name jack
sex m
注意到排列顺序不是字典中的顺序,因为字典本身是无序的,而变为Series后,关键字变为索引会顺序排列,a在前
from pandas import DataFrame,Series
info = {'name':'jack', 'age':28, 'sex':'m'}
obj = Series(info, ['vocation', 'name', 'age'])
print(obj)
vocation NaN
name jack
age 28
字典的情况下,给定索引值时,排列会按给定索引值顺序排,并且索引值与关键字相配的才能找到值,找不到对应值的就会显示NaN,表示数值的缺失
pandas中的isnull()和notnull()函数可以用于检测缺失数据:
import numpy as np
from pandas import DataFrame,Series
import pandas as pd
info = {'name':'jack', 'age':28, 'sex':'m'}
obj = Series(info, ['vocation', 'name', 'age'])
print(pd.isnull(obj))
vocation True
name False
age False
print(pd.notnull(obj))
vocation False
name True
age True
当然Series也直接有类似的实例方法:
print(obj.isnull())
print(obj.notnull())
from pandas import DataFrame,Series
import pandas as pd
info1 = {'beijing':10, 'suzhou':30, 'yangzhou':20,'hangzhou':40}
info2 = {'nanjing':10, 'yangzhou':20, 'suzhou':30, 'shanghai':40}
obj1 = Series(info1)
obj2 = Series(info2)
print(obj1 + obj2)
beijing NaN
hangzhou NaN
nanjing NaN
shanghai NaN
suzhou 60.0
yangzhou 40.0
发现,Series之间相加,会自动对齐不同索引
Series对象本身和索引都一个name属性,该属性和其他功能关系密切:
from pandas import DataFrame,Series
import pandas as pd
info1 = {'beijing':10, 'yangzhou':20, 'suzhou':30, 'hangzhou':40}
obj1 = Series(info1)
obj2 = Series(info2)
obj1.name = 'populaiton'
obj1.index.name = 'city'
print(obj1)
city
beijing 10
hangzhou 40
suzhou 30
yangzhou 20
Name: populaiton
索引值可以通过赋值的方式就地修改,但会清空name属性
from pandas import DataFrame,Series
import pandas as pd
info1 = {'beijing':10, 'yangzhou':20, 'suzhou':30, 'hangzhou':40}
obj1 = Series(info1)
obj1.name = 'populaiton'
obj1.index.name = 'city'
print(obj1)
obj1.index = ['nanjing', 'wuxi', 'nantong', 'changzhou']
print(obj1)
city
beijing 10
hangzhou 40
suzhou 30
yangzhou 20
Name: populaiton, dtype: int64
修改索引后name不见了
nanjing 10wuxi 40
nantong 30
changzhou 20
Name: populaiton, dtype: int64
2.DataFrame
DataFrame每列代表不同特征信息,每行代表索引,DataFrame最常见的构造方式:
from pandas import DataFrame,Series
import pandas as pd
data = { 'name':['Jack','Mike','Jone','Lily','Lucy'],
'age':['24','25','24','23','22'],
'sex':['m','m','m','w','w'] }
frame = DataFrame(data)
print(frame)
age name sex
0 24 Jack m
1 25 Mike m
2 24 Jone m
3 23 Lily w
4 22 Lucy w
发现特征列是按照字母顺序排序的,这是因为我们没有指定特征的顺序,下面指定顺序:
frame = DataFrame(data, columns=['name', 'age', 'sex'])
name age sex
0 Jack 24 m
1 Mike 25 m
2 Jone 24 m
3 Lily 23 w
4 Lucy 22 w
这次按照我们指定的排序排列了
我们还可以指定索引
frame = DataFrame(data, columns=['name', 'age', 'sex'], index=['one', 'two', 'three', 'four', 'five'])
name age sex
one Jack 24 m
two Mike 25 m
three Jone 24 m
four Lily 23 w
five Lucy 22 w
如果我们新增加一个特征,但是该特征并没有特征值,就会显示NaN
frame = DataFrame(data, columns=['name', 'age', 'sex', 'tel'], index=['one', 'two', 'three', 'four', 'five'])
name age sex tel
one Jack 24 m NaN
two Mike 25 m NaN
three Jone 24 m NaN
four Lily 23 w NaN
five Lucy 22 w NaN
可以查看所有特征名,也可以方便的查看每一列特征的所有数据:
print(frame.columns)
print(frame['name'])
print(frame['name'][:3])
Index(['name', 'age', 'sex', 'tel'], dtype='object')
one Jack
two Mike
three Jone
four Lily
five Lucy
Name: name
one Jack
two Mike
three Jone
Name:name
也可以:
print(frame.age[:2])
one 24
two 25
Name: age
也可以获取某个样本的所有数据:
print(frame.ix['three'])
name Jone
age 24
sex m
tel NaN
Name: three
也可以给某个特征赋值:
frame['tel'] = 178 #给所有样本的tel特征都赋值178
name age sex tel
one Jack 24 m 178
two Mike 25 m 178
three Jone 24 m 178
four Lily 23 w 178
five Lucy 22 w 178
frame['tel'] = np.arange(5) #每个赋值需要数组长度与样本数量相等
name age sex tel
one Jack 24 m 0
two Mike 25 m 1
three Jone 24 m 2
four Lily 23 w 3
five Lucy 22 w 4
将Series赋值给DataFrame:
val = Series( [132, 178, 151], index=['two', 'four', 'five'] ) #若不写索引,默认是0,1,2,与DataFrame索引不符合,无法赋值
frame['tel'] = val
print(frame)
name age sex tel
one Jack 24 m NaN
two Mike 25 m 132.0
three Jone 24 m NaN
four Lily 23 w 178.0
five Lucy 22 w 151.0
为不存在的列赋值会创建一个新的列:
frame['val'] = frame['name'] == 'Jack'
print(frame)
name age sex tel val
one Jack 24 m NaN True
two Mike 25 m NaN False
three Jone 24 m NaN False
four Lily 23 w NaN False
five Lucy 22 w NaN False
使用del可以删除一列特征:
del frame['val']
name age sex tel
one Jack 24 m NaN
two Mike 25 m NaN
three Jone 24 m NaN
four Lily 23 w NaN
five Lucy 22 w NaN
另一种常见的数据形式是嵌套字典:
from pandas import DataFrame,Series
import pandas as pd
population = { 'nanjing':{2001: 2.4, 2002: 2.9},
'yangzhou':{2000: 1.7, 2001:1.8, 2002: 1.9} }
frame2 = DataFrame(population)
print(frame2)
nanjing yangzhou
2000 NaN 1.7
2001 2.4 1.8
2002 2.9 1.9
print(frame2.T)
2000 2001 2002
nanjing NaN 2.4 2.9
yangzhou 1.7 1.8 1.9
还可以指定索引:
frame2 = DataFrame(population, index=[2001, 2002, 2003])
nanjing yangzhou
2001 2.4 1.8
2002 2.9 1.9
2003 NaN NaN
from pandas import DataFrame,Series
import pandas as pd
population = { 'nanjing':{2001: 2.4, 2002: 2.9},
'yangzhou':{2000: 1.7, 2001:1.8, 2002: 1.9} }
frame2 = DataFrame(population)
print(frame2)
pdata = { 'nanjing': frame2['nanjing'][:2],
'yangzhou':frame2['yangzhou'][:-1] }
print( DataFrame(pdata) )
frame2:
nanjing yangzhou
2000 NaN 1.7
2001 2.4 1.8
2002 2.9 1.9
在frame2的基础上组成一个字典:
{'nanjing': 2000 NaN
2001 2.4
Name: nanjing, dtype: float64,
'yangzhou': 2000 1.7
2001 1.8Name: yangzhou, dtype: float64}
再变为DataFrame:
nanjing yangzhou
2000 NaN 1.7
2001 2.4 1.8
frame2.index.name = 'year'
frame2.columns.name = 'city'
print(frame2)
city nanjing yangzhou
year
2000 NaN 1.7
2001 2.4 1.8
2002 2.9 1.9
print(frame2.values)
array( [[ nan 1.7]
[ 2.4 1.8]
[ 2.9 1.9]] )