创建一个DataFrame的三种方法
1、用字典dict,字典值value是列表list
population = {"city":["beijing","shanghai","guangzhou","shenzhen","hangzhou","chongqing"],
"year":[2016,2017,2016,2017,2017,2016],
"population":[2100,2300,1000,700,500,500]}#字典里的键和值必须一一对应,否则会报错
population = pd.DataFrame(population)
print(population)
city population year
0 beijing 2100 2016
1 shanghai 2300 2017
2 guangzhou 1000 2016
3 shenzhen 700 2017
4 hangzhou 500 2017
5 chongqing 500 2016
pdc = pd.DataFrame(population,columns=["year","city","population"])#改变列的参数
print(pdc)
year city population
0 2016 beijing 2100
1 2017 shanghai 2300
2 2016 guangzhou 1000
3 2017 shenzhen 700
4 2017 hangzhou 500
5 2016 chongqing 500
temp = {"city":["beijing","shanghai","guangzhou","shenzhen","hangzhou","chongqing"],
"year":[2016,2017,2016,2017,2017,2016],
"population":[2100,2300,1000,700,500,500]}
pdci = pd.DataFrame(temp,columns=["year","city","population"],index = ['one','two','three','four','five','six'])
#改变列的顺序和索引格式
print(pdci)
year city population
one 2016 beijing 2100
two 2017 shanghai 2300
three 2016 guangzhou 1000
four 2017 shenzhen 700
five 2017 hangzhou 500
six 2016 chongqing 500
2、用series构建DataFrame
from pandas import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
#print(apts2)
apts=apts+apts2
apts[apts.isnull()]=apts.mean()#缺省值用中位数填充
#print(apts)
df=pd.DataFrame({'apts':apts,'apts2':apts2})#两个series合并成一个df,共有的键显示值,非共有的显示NaN
apts apts2
Beijing 65000.0 10000.0
Chongqing 64000.0 30000.0
Guangzhou 47000.0 7000.0
Hangzhou 64000.0 NaN
Shanghai 68000.0 8000.0
Suzhou 64000.0 NaN
Tianjin 64000.0 40000.0
shenzhen 76000.0 6000.0
3、用一个字典构成的列表list of dicts来构建DataFrame
data = [{'lucy':9999,'linus':8888,'curry':100000},{'lucy':9998,'linus':8887,'curry':1000000}]
pd2 = pd.DataFrame(data,index=['salary1','salary2'])#一个疑问,为什么Lucy在最后?
print(pd2)
curry linus lucy
salary1 100000 8888 9999
salary2 1000000 8887 9998
广播特性
from pandas import pan