DataFrame就是由一列一列的Series组成的,pandas相当于python的excel
1.pandas无论是Series还是DataFrame的index都可以包含重复的index
date={'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada','Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df1=pd.DataFrame(date,columns=['year','year','pop','debt'],index=['one','two','three','four','five','six'])
df1
year year pop debt
one 2000 2000 1.5 NaN
two 2001 2001 1.7 NaN
three 2002 2002 3.6 NaN
four 2001 2001 2.4 NaN
five 2002 2002 2.9 NaN
six 2003 2003 3.2 NaN
2.Series和dataFrame选取数据值
#Series可以采用索引和0,1,2...
d={'name':'zy','sex':'female','age':12}
pd2=pd.Series(d)
pd2[2]
'female'
pd2['age']
12
#选取多个值
pd2[['age','sex']]
age 12
sex female
dtype: object
pd2[[1,0]]
name zy
age 12
dtype: object
#DataFrame
df1
year year pop debt
one 2000 2000 1.5 NaN
two 2001 2001 1.7 NaN
three 2002 2002 3.6 NaN
four 2001 2001 2.4 NaN
five 2002 2002 2.9 NaN
six 2003 2003 3.2 NaN
df1['year'] #列
year year
one 2000 2000
two 2001 2001
three 2002 2002
four 2001 2001
five 2002 2002
six 2003 2003
df1[['year','pop']] #多个列
year year pop
one 2000 2000 1.5
two 2001 2001 1.7
three 2002 2002 3.6
four 2001 2001 2.4
five 2002 2002 2.9
six 2003 2003 3.2
df1.loc['one'] #行 index
year 2000
year 2000
pop 1.5
debt NaN
Name: one, dtype: object
df1.loc[['one','two']] #多个行
year year pop debt
one 2000 2000 1.5 NaN
two 2001 2001 1.7 NaN
df1[:4] #取前四行 无df1[1]
df3=pd.DataFrame(np.arange(16).reshape(4,4),index=['a','b','c','d'],columns=['one','two','three','four'])
df3[df1['three']>2] #也是对索引index即行进行操作
#还可以采用reindex() 重新索引 该方式可以产生空值NaN
df1.reindex(['one','four','seven'])
year year pop debt
one 1.0 1.0 1.0 1
four 2001.0 2001.0 2.4 NaN
seven NaN NaN NaN NaN
#选取列
df2=df1[['pop','debt']]#因为df1中有重复列,无法使用该函数选取列
df2.reindex(columns=['debt','pop'])
debt pop
one 1 1.0
two NaN 1.7
three NaN 3.6
four NaN 2.4
five NaN 2.9
six NaN 3.2
3.利用标签的切片运算和普通的python切片不同,包含末端
df4=pd.Series(np.arange(4),index=['a','b','c','d'])
df4
a 0
b 1
c 2
d 3
df4.loc['a':'c']
a 0
b 1
c 2
dtype: int64
df4[1:3]
b 1
c 2
dtype: int64
df4.iloc[:2]#整数切片
0 0
1 1
dtype: int64
df4.loc[:2] #利用索引标签值切片
0 0
1 1
2 2
dtype: int64
4.value_counts不仅可以对Series各值出现的频率,也可以统计任何数组和序列,必须是一维的,如series,list,一维ndarray
5.过滤函数isin()
#Series
pd1=pd.Series(np.random.randint(0,10,(6,)),index=list('abcdef'))
pd1
a 5
b 3
c 9
d 8
e 6
f 8
dtype: int64
pd1.isin([1,2,3,4,5])
a True
b True
c False
d False
e False
f False
dtype: bool
pd1[pd1.isin([1,2,3,4,5])]
a 5
b 3
dtype: int64
#Dateframe
pd2=pd.DataFrame(np.random.randint(0,10,(4,4)),columns=list('abcd'),index=list('abcd'))
pd2
a b c d
a 1 7 6 1
b 3 1 5 7
c 2 8 4 0
d 8 7 7 0
pd2.isin([1,2,3])#value为list类型
a b c d
a True False False True
b True True False False
c True False False False
d False False False False
#过滤会产生空值
pd2[pd2.isin([1,2,3])]
a b c d
a 1.0 NaN NaN 1.0
b 3.0 1.0 NaN NaN
c 2.0 NaN NaN NaN
d NaN NaN NaN NaN
#value类型为字典,匹配键值
value1={'a':[1,2,3],'b':[1,2,3]}
pd2.isin(value1)
a b c d
a True False False False
b True True False False
c True False False False
d False False False False
#value类型为Dateframe 一一匹配 index和columns都要对应(二维)
value2=pd.DataFrame(value1)
pd2.isin(value2)
a b c d
a False False False False
b False False False False
c False False False False
d False False False False
value2
a b
0 1 1
1 2 2
2 3 3
#index也要一样
value2=pd.DataFrame(value1,index=list('abc'))
pd2.isin(value2)
a b c d
a True False False False
b False False False False
c False False False False
d False False False False
6.df2.drop(axis=)可以用来删除DataFrame的行或列,注意axis=1 是表示删除一列,axis=0表示删除一行
7.pandas中的空值类型为np.float64 即np.nan pd1=pd.Series([1,np.nan,3])
dropna()默认删除有空值的行 删除列可使axis=1 只要有空值就整行整列删除,drop(how=all)表示只删除所有值都是空值的行和列.fillna()表示用其他值代替NaN
8,df.loc 既可以选取行也可以选取列,逗号前是行操作,逗号后是列操作