创建一个DataFrame
(1)用字典dict,字典值value是列表list
(2)用Series构建DataFrame
(3)用一个字典构成的列表list of dicts来构建DataFrame
广播特性
定位DataFrame里的元素
(1)利用表达式boolean定位
(2)利用loc,iloc,ix函数定位
可以定位数字,就可以赋值
info()和describe()
缺省值填充fillna,ffill,bfill
层次化的index
unstack:Series转化成DataFrame
csv文件读写read_ csv/to_csv
一个DataFrame就是一张表格,Series可以理解成一维数据,DataFrame就是一个二维数据,DataFrame可以由多个Series组成(DataFrame可以理解成Series的一个集合)
创建一个DataFrame
(1)用字典dict,字典值value是列表list
population={'city':['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou','Chongqing'],
'year':[2016,2017,2016,2017,2016,2016],
'population':[2100,2300,1000,700,500,500]
}
population=pd.DataFrame(population) ###
print(population)
- 1
- 2
- 3
- 4
- 5
- 6
city population year
0 Beijing 2100 2016
1 Shanghai 2300 2017
2 Guangzhou 1000 2016
3 Shenzhen 700 2017
4 Hangzhou 500 2016
5 Chongqing 500 2016
- 1
- 2
- 3
- 4
- 5
- 6
- 7
pdc=pd.DataFrame(population,columns=['year','city','population']) #columns参数改变列名
print(pdc)
- 1
- 2
year city population
0 2016 Beijing 2100
1 2017 Shanghai 2300
2 2016 Guangzhou 1000
3 2017 Shenzhen 700
4 2016 Hangzhou 500
5 2016 Chongqing 500
- 1
- 2
- 3
- 4
- 5
- 6
- 7
tmp={'city':['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou','Chongqing'],
'year':[2016,2017,2016,2017,2016,2016],
'population':[2100,2300,1000,700,500,500]
}
pdci=pd.DataFrame(tmp,columns=['year','city','population'],
index=['one','two','three','four','five','six']) #改变行index索引和列名columns
print(pdci)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
year city population
one 2016 Beijing 2100
two 2017 Shanghai 2300
three 2016 Guangzhou 1000
four 2017 Shenzhen 700
five 2016 Hangzhou 500
six 2016 Chongqing 500
- 1
- 2
- 3
- 4
- 5
- 6
- 7
(2)用Series构建DataFrame
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
# print(apts2)
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
# print(apts)
df=pd.DataFrame({'apts':apts,'apts2':apts2}) ###
print(df)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
apts apts2
Beijing 65000.0 10000.0
Chongqing 64000.0 30000.0
Guangzhou 47000.0 7000.0
Hangzhou 64000.0 NaN
Shanghai 68000.0 8000.0
Suzhou 64000.0 NaN
Tianjin 64000.0 40000.0
shenzhen 76000.0 6000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
(3)用一个字典构成的列表list of dicts来构建DataFrame
data=[{'JackMa':99999999999,'Han':5000,'David':10000},
{'JackMa':99999999998,'Han':4000,'David':11000}]
pdl=pd.DataFrame(data,index=['salary1','salary2'])
print(pdl)
- 1
- 2
- 3
- 4
David Han JackMa
salary1 10000 5000 99999999999
salary2 11000 4000 99999999998
- 1
- 2
- 3
广播特性
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
#print(df)
df['bonus']=2000 #增加一个新列bonus,并且都赋值2000
print(df)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
apts apts2 bonus
Beijing 65000.0 10000.0 2000
Chongqing 64000.0 30000.0 2000
Guangzhou 47000.0 7000.0 2000
Hangzhou 64000.0 NaN 2000
Shanghai 68000.0 8000.0 2000
Suzhou 64000.0 NaN 2000
Tianjin 64000.0 40000.0 2000
shenzhen 76000.0 6000.0 2000
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
print(df)
- 1
- 2
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 147000.0
Chongqing 64000.0 30000.0 2000 175000.0
Guangzhou 47000.0 7000.0 2000 106500.0
Hangzhou 64000.0 NaN 2000 NaN
Shanghai 68000.0 8000.0 2000 150000.0
Suzhou 64000.0 NaN 2000 NaN
Tianjin 64000.0 40000.0 2000 190000.0
shenzhen 76000.0 6000.0 2000 163000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
print(df.index)
- 1
Index(['Beijing', 'Chongqing', 'Guangzhou', 'Hangzhou', 'Shanghai', 'Suzhou',
'Tianjin', 'shenzhen'],
dtype='object')
- 1
- 2
- 3
定位DataFrame里的元素
(1)利用表达式boolean定位
import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
df['bonus']=2000 #增加一个新列bonus,并且都赋值2000
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
#print(df)
# apts apts2 bonus income
#Beijing 65000.0 10000.0 2000 147000.0
#Chongqing 64000.0 30000.0 2000 175000.0
#Guangzhou 47000.0 7000.0 2000 106500.0
#Hangzhou 64000.0 NaN 2000 NaN
#Shanghai 68000.0 8000.0 2000 150000.0
#Suzhou 64000.0 NaN 2000 NaN
#Tianjin 64000.0 40000.0 2000 190000.0
#shenzhen 76000.0 6000.0 2000 163000.0
print(df.apts==64000)
print(df['apts']==64000) #boolean条件
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
Beijing False
Chongqing True
Guangzhou False
Hangzhou True
Shanghai False
Suzhou True
Tianjin True
shenzhen False
Name: apts, dtype: bool
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
print(df[df['apts']==64000]) #对行做选择,就是把apts列等于64000的行取出来
- 1
apts apts2 bonus income
Chongqing 64000.0 30000.0 2000 175000.0
Hangzhou 64000.0 NaN 2000 NaN
Suzhou 64000.0 NaN 2000 NaN
Tianjin 64000.0 40000.0 2000 190000.0
- 1
- 2
- 3
- 4
- 5
df[df.apts==64000]['income']=200000 #报错,在复制片段上赋值,原来的df没被改变
- 1
(2)利用loc,iloc,ix函数定位
loc:通过“行标签”索引行数据
print(df.loc['Hangzhou']) #定位选某一行
- 1
apts 64000.0
apts2 NaN
bonus 2000.0
income NaN
Name: Hangzhou, dtype: float64
- 1
- 2
- 3
- 4
- 5
print(df.loc[['Hangzhou','Shanghai']])
- 1
apts apts2 bonus income
Hangzhou 64000.0 NaN 2000 NaN
Shanghai 68000.0 8000.0 2000 150000.0
- 1
- 2
- 3
print(df.loc[df['apts']==64000,['apts2','apts','bonus']])
#前面的部分是对行做选择,后面的部分是对列做选择
- 1
- 2
apts2 apts bonus
Chongqing 30000.0 64000.0 2000
Hangzhou NaN 64000.0 2000
Suzhou NaN 64000.0 2000
Tianjin 40000.0 64000.0 2000
- 1
- 2
- 3
- 4
- 5
iloc:通过“行号”索引行数据
print(df.iloc[0:5])
- 1
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 147000.0
Chongqing 64000.0 30000.0 2000 175000.0
Guangzhou 47000.0 7000.0 2000 106500.0
Hangzhou 64000.0 NaN 2000 NaN
Shanghai 68000.0 8000.0 2000 150000.0
- 1
- 2
- 3
- 4
- 5
- 6
ix:通过行标签或者行号索引行数据(基于loc和iloc 的混合)
print(df.ix[1:4,1:3]) #用行号和列号做数据选择
- 1
apts2 bonus
Chongqing 30000.0 2000
Guangzhou 7000.0 2000
Hangzhou NaN 2000
- 1
- 2
- 3
- 4
可以定位数字,就可以赋值
df.loc[:,'income']=5000
print(df)
- 1
- 2
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 5000
Chongqing 64000.0 30000.0 2000 5000
Guangzhou 47000.0 7000.0 2000 5000
Hangzhou 64000.0 NaN 2000 5000
Shanghai 68000.0 8000.0 2000 5000
Suzhou 64000.0 NaN 2000 5000
Tianjin 64000.0 40000.0 2000 5000
shenzhen 76000.0 6000.0 2000 5000
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
info()和describe()
info
print(df.info())
- 1
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, Beijing to shenzhen
Data columns (total 4 columns):
apts 8 non-null float64
apts2 6 non-null float64
bonus 8 non-null int64
income 8 non-null int64
dtypes: float64(2), int64(2)
memory usage: 320.0+ bytes
None
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
describe
print(df.describe())
- 1
apts apts2 bonus income
count 8.000000 6.000000 8.0 8.0
mean 64000.000000 16833.333333 2000.0 5000.0
std 8017.837257 14483.323744 0.0 0.0
min 47000.000000 6000.000000 2000.0 5000.0
25% 64000.000000 7250.000000 2000.0 5000.0
50% 64000.000000 9000.000000 2000.0 5000.0
75% 65750.000000 25000.000000 2000.0 5000.0
max 76000.000000 40000.000000 2000.0 5000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
head
print(df.head(2))
- 1
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 5000
Chongqing 64000.0 30000.0 2000 5000
- 1
- 2
- 3
tail
print(df.tail(2))
- 1
apts apts2 bonus income
Tianjin 64000.0 40000.0 2000 5000
shenzhen 76000.0 6000.0 2000 5000
- 1
- 2
- 3
条件判断与条件组合
#df2.loc[((df2['dow']==0)|(df2['dow']==2)|(df2['dow']==4)),:]
#df2.loc[ df2['dow'].isin([0,2,4]) , : ] #可以是一个列表,numpy array,Series
#~(df2['dow'].isin([0,2,4]))
- 1
- 2
- 3
缺省值填充fillna,ffill,bfill
fillna
import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
df['bonus']=2000
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
#print(df)
# apts apts2 bonus income
#Beijing 65000.0 10000.0 2000 147000.0
#Chongqing 64000.0 30000.0 2000 175000.0
#Guangzhou 47000.0 7000.0 2000 106500.0
#Hangzhou 64000.0 NaN 2000 NaN
#Shanghai 68000.0 8000.0 2000 150000.0
#Suzhou 64000.0 NaN 2000 NaN
#Tianjin 64000.0 40000.0 2000 190000.0
#shenzhen 76000.0 6000.0 2000 163000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
dff=df.fillna(value=0) #df没变
print(dff)
- 1
- 2
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 147000.0
Chongqing 64000.0 30000.0 2000 175000.0
Guangzhou 47000.0 7000.0 2000 106500.0
Hangzhou 64000.0 0.0 2000 0.0
Shanghai 68000.0 8000.0 2000 150000.0
Suzhou 64000.0 0.0 2000 0.0
Tianjin 64000.0 40000.0 2000 190000.0
shenzhen 76000.0 6000.0 2000 163000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
inplace
dff=df.fillna(value=0, inplace=True)
print(df);print(dff) #inplace参数True,df改变,没有新的dff拷贝
- 1
- 2
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 147000.0
Chongqing 64000.0 30000.0 2000 175000.0
Guangzhou 47000.0 7000.0 2000 106500.0
Hangzhou 64000.0 0.0 2000 0.0
Shanghai 68000.0 8000.0 2000 150000.0
Suzhou 64000.0 0.0 2000 0.0
Tianjin 64000.0 40000.0 2000 190000.0
shenzhen 76000.0 6000.0 2000 163000.0
None
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
ffill
dffr=df.fillna(method='ffill') #新生成的补NaN前向拷贝,df没变
print(dffr)
- 1
- 2
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 147000.0
Chongqing 64000.0 30000.0 2000 175000.0
Guangzhou 47000.0 7000.0 2000 106500.0
Hangzhou 64000.0 7000.0 2000 106500.0
Shanghai 68000.0 8000.0 2000 150000.0
Suzhou 64000.0 8000.0 2000 150000.0
Tianjin 64000.0 40000.0 2000 190000.0
shenzhen 76000.0 6000.0 2000 163000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
bfill
dfba=df.fillna(method='bfill') #新生成的补NaN后向拷贝,df没变
print(dfba)
- 1
- 2
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 147000.0
Chongqing 64000.0 30000.0 2000 175000.0
Guangzhou 47000.0 7000.0 2000 106500.0
Hangzhou 64000.0 8000.0 2000 150000.0
Shanghai 68000.0 8000.0 2000 150000.0
Suzhou 64000.0 40000.0 2000 190000.0
Tianjin 64000.0 40000.0 2000 190000.0
shenzhen 76000.0 6000.0 2000 163000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
层次化的index
import pandas as pd
import numpy as np
data=pd.Series(np.random.randn(10),index=[['a','a','a','b','b','c','c','d','d','d'],
[1,2,3,1,2,1,2,1,2,3]])
print(data)
print(type(data))
- 1
- 2
- 3
- 4
- 5
- 6
a 1 0.346467
2 -0.043077
3 0.043878
b 1 0.107763
2 -0.175726
c 1 -1.833683
2 0.033884
d 1 -1.807021
2 0.819740
3 0.294679
dtype: float64
<class 'pandas.core.series.Series'>
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
print(data.index)
- 1
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 0, 1, 0, 1, 2]])
- 1
- 2
print(data['b':'c'])
- 1
b 1 0.353241
2 0.379744
c 1 -0.860706
2 -0.795483
dtype: float64
- 1
- 2
- 3
- 4
- 5
print(data[:2])
- 1
a 1 0.763116
2 0.058009
dtype: float64
- 1
- 2
- 3
unstack:Series转化成DataFrame
unstack=data.unstack() #将层级数据横向拉开,不够长的补NaN
print(unstack)
print(type(unstack))
- 1
- 2
- 3
1 2 3
a -0.637935 -0.104897 -1.536381
b 2.448302 1.679833 NaN
c -0.845155 0.829459 NaN
d 0.597535 -0.464255 -0.898994
<class 'pandas.core.frame.DataFrame'> #对比data的类型
- 1
- 2
- 3
- 4
- 5
- 6
csv文件读写read_csv/to_csv
import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
df['bonus']=2000
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
#print(df)
df.to_csv('df.csv')
df.to_csv('df2.csv',index=False) #去掉第一列,行索引列
import os
df2_site = r"D:\PYTHON35\idle\df2.csv"
pwd = os.getcwd() #获取当前工作目录
os.chdir(os.path.dirname(df2_site))
tmp_df = pd.read_csv(os.path.basename(df2_site)) ###
print(tmp_df)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
apts apts2 bonus income
0 65000.0 10000.0 2000 147000.0
1 64000.0 30000.0 2000 175000.0
2 47000.0 7000.0 2000 106500.0
3 64000.0 NaN 2000 NaN
4 68000.0 8000.0 2000 150000.0
5 64000.0 NaN 2000 NaN
6 64000.0 40000.0 2000 190000.0
7 76000.0 6000.0 2000 163000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
tmp_df_index = pd.Index(['Beijing','Shanghai',"Suzhou",'Hangzhou','Tianjin','Chongqing','Nanjing','Shenzhen'])
tmp_df.index=tmp_df_index #修改索引
print(tmp_df)
- 1
- 2
- 3
apts apts2 bonus income
Beijing 65000.0 10000.0 2000 147000.0
Shanghai 64000.0 30000.0 2000 175000.0
Suzhou 47000.0 7000.0 2000 106500.0
Hangzhou 64000.0 NaN 2000 NaN
Tianjin 68000.0 8000.0 2000 150000.0
Chongqing 64000.0 NaN 2000 NaN
Nanjing 64000.0 40000.0 2000 190000.0
Shenzhen 76000.0 6000.0 2000 163000.0
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
df.to_csv('df3.csv',sep='\t')