Python3pandas库DataFrame用法(基础整理)

创建一个DataFrame:

一个DataFrame就是一张表格,Series可以理解成一维数据,DataFrame就是一个二维数据,DataFrame可以由多个Series组成(DataFrame可以理解成Series的一个集合)

1)用字典dict,字典值value是列表list

population={'city':['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou','Chongqing'],
            'year':[2016,2017,2016,2017,2016,2016],
            'population':[2100,2300,1000,700,500,500]
            }
population=pd.DataFrame(population)   ###
print(population)
        city  population  year
0    Beijing        2100  2016
1   Shanghai        2300  2017
2  Guangzhou        1000  2016
3   Shenzhen         700  2017
4   Hangzhou         500  2016
5  Chongqing         500  2016
pdc=pd.DataFrame(population,columns=['year','city','population'])   #columns参数改变列名
print(pdc)
   year       city  population
0  2016    Beijing        2100
1  2017   Shanghai        2300
2  2016  Guangzhou        1000
3  2017   Shenzhen         700
4  2016   Hangzhou         500
5  2016  Chongqing         500
tmp={'city':['Beijing','Shanghai','Guangzhou','Shenzhen','Hangzhou','Chongqing'],
     'year':[2016,2017,2016,2017,2016,2016],
     'population':[2100,2300,1000,700,500,500]
     }
pdci=pd.DataFrame(tmp,columns=['year','city','population'],
                  index=['one','two','three','four','five','six'])  #改变行index索引和列名columns
print(pdci)
      year       city  population
one    2016    Beijing        2100
two    2017   Shanghai        2300
three  2016  Guangzhou        1000
four   2017   Shenzhen         700
five   2016   Hangzhou         500
six    2016  Chongqing         500

2)用Series构建DataFrame

cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
# print(apts2)
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
# print(apts)

df=pd.DataFrame({'apts':apts,'apts2':apts2})   ###
print(df)
              apts    apts2
Beijing    65000.0  10000.0
Chongqing  64000.0  30000.0
Guangzhou  47000.0   7000.0
Hangzhou   64000.0      NaN
Shanghai   68000.0   8000.0
Suzhou     64000.0      NaN
Tianjin    64000.0  40000.0
shenzhen   76000.0   6000.0

3)用一个字典构成的列表list of dicts来构建DataFrame

data=[{'JackMa':99999999999,'Han':5000,'David':10000},
   {'JackMa':99999999998,'Han':4000,'David':11000}]
pdl=pd.DataFrame(data,index=['salary1','salary2'])
print(pdl)
         David   Han       JackMa
salary1  10000  5000  99999999999
salary2  11000  4000  99999999998

广播特性

cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
#print(df)
df['bonus']=2000  #增加一个新列bonus,并且都赋值2000
print(df)
              apts    apts2  bonus
Beijing    65000.0  10000.0   2000
Chongqing  64000.0  30000.0   2000
Guangzhou  47000.0   7000.0   2000
Hangzhou   64000.0      NaN   2000
Shanghai   68000.0   8000.0   2000
Suzhou     64000.0      NaN   2000
Tianjin    64000.0  40000.0   2000
shenzhen   76000.0   6000.0   2000
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
print(df)
              apts    apts2  bonus    income
Beijing    65000.0  10000.0   2000  147000.0
Chongqing  64000.0  30000.0   2000  175000.0
Guangzhou  47000.0   7000.0   2000  106500.0
Hangzhou   64000.0      NaN   2000       NaN
Shanghai   68000.0   8000.0   2000  150000.0
Suzhou     64000.0      NaN   2000       NaN
Tianjin    64000.0  40000.0   2000  190000.0
shenzhen   76000.0   6000.0   2000  163000.0
print(df.index)
Index(['Beijing', 'Chongqing', 'Guangzhou', 'Hangzhou', 'Shanghai', 'Suzhou',
       'Tianjin', 'shenzhen'],
      dtype='object')

定位DataFrame里的元素

1)利用表达式boolean定位

import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
df['bonus']=2000  #增加一个新列bonus,并且都赋值2000
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
#print(df)
#              apts    apts2  bonus    income
#Beijing    65000.0  10000.0   2000  147000.0
#Chongqing  64000.0  30000.0   2000  175000.0
#Guangzhou  47000.0   7000.0   2000  106500.0
#Hangzhou   64000.0      NaN   2000       NaN
#Shanghai   68000.0   8000.0   2000  150000.0
#Suzhou     64000.0      NaN   2000       NaN
#Tianjin    64000.0  40000.0   2000  190000.0
#shenzhen   76000.0   6000.0   2000  163000.0
print(df.apts==64000)
print(df['apts']==64000)  #boolean条件
Beijing      False
Chongqing     True
Guangzhou    False
Hangzhou      True
Shanghai     False
Suzhou        True
Tianjin       True
shenzhen     False
Name: apts, dtype: bool
print(df[df['apts']==64000]) #对行做选择,就是把apts列等于64000的行取出来
              apts    apts2  bonus    income
Chongqing  64000.0  30000.0   2000  175000.0
Hangzhou   64000.0      NaN   2000       NaN
Suzhou     64000.0      NaN   2000       NaN
Tianjin    64000.0  40000.0   2000  190000.0
df[df.apts==64000]['income']=200000 #报错,在复制片段上赋值,原来的df没被改变

2)利用loc,iloc,ix函数定位

loc:通过“行标签”索引行数据
print(df.loc['Hangzhou'])  #定位选某一行
apts      64000.0
apts2         NaN
bonus      2000.0
income        NaN
Name: Hangzhou, dtype: float64
print(df.loc[['Hangzhou','Shanghai']])
             apts   apts2  bonus    income
Hangzhou  64000.0     NaN   2000       NaN
Shanghai  68000.0  8000.0   2000  150000.0
print(df.loc[df['apts']==64000,['apts2','apts','bonus']])
#前面的部分是对行做选择,后面的部分是对列做选择
             apts2     apts  bonus
Chongqing  30000.0  64000.0   2000
Hangzhou       NaN  64000.0   2000
Suzhou         NaN  64000.0   2000
Tianjin    40000.0  64000.0   2000
iloc:通过“行号”索引行数据
 print(df.iloc[0:5])
              apts    apts2  bonus    income
Beijing    65000.0  10000.0   2000  147000.0
Chongqing  64000.0  30000.0   2000  175000.0
Guangzhou  47000.0   7000.0   2000  106500.0
Hangzhou   64000.0      NaN   2000       NaN
Shanghai   68000.0   8000.0   2000  150000.0
ix:通过行标签或者行号索引行数据(基于loc和iloc 的混合)
print(df.ix[1:4,1:3])  #用行号和列号做数据选择
             apts2  bonus
Chongqing  30000.0   2000
Guangzhou   7000.0   2000
Hangzhou       NaN   2000

可以定位数字,就可以赋值

df.loc[:,'income']=5000
print(df)
              apts    apts2  bonus  income
Beijing    65000.0  10000.0   2000    5000
Chongqing  64000.0  30000.0   2000    5000
Guangzhou  47000.0   7000.0   2000    5000
Hangzhou   64000.0      NaN   2000    5000
Shanghai   68000.0   8000.0   2000    5000
Suzhou     64000.0      NaN   2000    5000
Tianjin    64000.0  40000.0   2000    5000
shenzhen   76000.0   6000.0   2000    5000

info()、describe()、head()、tail()

print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, Beijing to shenzhen
Data columns (total 4 columns):
apts      8 non-null float64
apts2     6 non-null float64
bonus     8 non-null int64
income    8 non-null int64
dtypes: float64(2), int64(2)
memory usage: 320.0+ bytes
None
print(df.describe())
               apts         apts2   bonus  income
count      8.000000      6.000000     8.0     8.0
mean   64000.000000  16833.333333  2000.0  5000.0
std     8017.837257  14483.323744     0.0     0.0
min    47000.000000   6000.000000  2000.0  5000.0
25%    64000.000000   7250.000000  2000.0  5000.0
50%    64000.000000   9000.000000  2000.0  5000.0
75%    65750.000000  25000.000000  2000.0  5000.0
max    76000.000000  40000.000000  2000.0  5000.0
print(df.head(2))
              apts    apts2  bonus  income
Beijing    65000.0  10000.0   2000    5000
Chongqing  64000.0  30000.0   2000    5000
print(df.tail(2))
             apts    apts2  bonus  income
Tianjin   64000.0  40000.0   2000    5000
shenzhen  76000.0   6000.0   2000    5000

条件判断与条件组合

#df2.loc[((df2['dow']==0)|(df2['dow']==2)|(df2['dow']==4)),:]
#df2.loc[ df2['dow'].isin([0,2,4]) , : ]  #可以是一个列表,numpy array,Series
#~(df2['dow'].isin([0,2,4]))

缺省值填充fillna,ffill,bfill

fillna
import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
df['bonus']=2000 
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
#print(df)
#              apts    apts2  bonus    income
#Beijing    65000.0  10000.0   2000  147000.0
#Chongqing  64000.0  30000.0   2000  175000.0
#Guangzhou  47000.0   7000.0   2000  106500.0
#Hangzhou   64000.0      NaN   2000       NaN
#Shanghai   68000.0   8000.0   2000  150000.0
#Suzhou     64000.0      NaN   2000       NaN
#Tianjin    64000.0  40000.0   2000  190000.0
#shenzhen   76000.0   6000.0   2000  163000.0
dff=df.fillna(value=0)   #df没变
print(dff)
              apts    apts2  bonus    income
Beijing    65000.0  10000.0   2000  147000.0
Chongqing  64000.0  30000.0   2000  175000.0
Guangzhou  47000.0   7000.0   2000  106500.0
Hangzhou   64000.0      0.0   2000       0.0
Shanghai   68000.0   8000.0   2000  150000.0
Suzhou     64000.0      0.0   2000       0.0
Tianjin    64000.0  40000.0   2000  190000.0
shenzhen   76000.0   6000.0   2000  163000.0
inplace
dff=df.fillna(value=0, inplace=True)
print(df);print(dff)  #inplace参数True,df改变,没有新的dff拷贝
              apts    apts2  bonus    income
Beijing    65000.0  10000.0   2000  147000.0
Chongqing  64000.0  30000.0   2000  175000.0
Guangzhou  47000.0   7000.0   2000  106500.0
Hangzhou   64000.0      0.0   2000       0.0
Shanghai   68000.0   8000.0   2000  150000.0
Suzhou     64000.0      0.0   2000       0.0
Tianjin    64000.0  40000.0   2000  190000.0
shenzhen   76000.0   6000.0   2000  163000.0
None
ffill
dffr=df.fillna(method='ffill')   #新生成的补NaN前向拷贝,df没变
print(dffr)
              apts    apts2  bonus    income
Beijing    65000.0  10000.0   2000  147000.0
Chongqing  64000.0  30000.0   2000  175000.0
Guangzhou  47000.0   7000.0   2000  106500.0
Hangzhou   64000.0   7000.0   2000  106500.0
Shanghai   68000.0   8000.0   2000  150000.0
Suzhou     64000.0   8000.0   2000  150000.0
Tianjin    64000.0  40000.0   2000  190000.0
shenzhen   76000.0   6000.0   2000  163000.0
bfill
dfba=df.fillna(method='bfill')   #新生成的补NaN后向拷贝,df没变
print(dfba)
              apts    apts2  bonus    income
Beijing    65000.0  10000.0   2000  147000.0
Chongqing  64000.0  30000.0   2000  175000.0
Guangzhou  47000.0   7000.0   2000  106500.0
Hangzhou   64000.0   8000.0   2000  150000.0
Shanghai   68000.0   8000.0   2000  150000.0
Suzhou     64000.0  40000.0   2000  190000.0
Tianjin    64000.0  40000.0   2000  190000.0
shenzhen   76000.0   6000.0   2000  163000.0

层次化的index

import pandas as pd
import numpy as np
data=pd.Series(np.random.randn(10),index=[['a','a','a','b','b','c','c','d','d','d'],
 [1,2,3,1,2,1,2,1,2,3]])
print(data)
print(type(data))
a  1    0.346467
   2   -0.043077
   3    0.043878
b  1    0.107763
   2   -0.175726
c  1   -1.833683
   2    0.033884
d  1   -1.807021
   2    0.819740
   3    0.294679
dtype: float64
<class 'pandas.core.series.Series'>
print(data.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 0, 1, 0, 1, 2]])
print(data['b':'c'])
b  1    0.353241
   2    0.379744
c  1   -0.860706
   2   -0.795483
dtype: float64
print(data[:2])
a  1    0.763116
   2    0.058009
dtype: float64

unstack:Series转化成DataFrame

unstack=data.unstack()   #将层级数据横向拉开,不够长的补NaN
print(unstack)
print(type(unstack))
          1         2         3
a -0.637935 -0.104897 -1.536381
b  2.448302  1.679833       NaN
c -0.845155  0.829459       NaN
d  0.597535 -0.464255 -0.898994
<class 'pandas.core.frame.DataFrame'>  #对比data的类型

csv文件读写read_ csv/to_csv

import pandas as pd
cities={'Beijing':55000,'Shanghai':60000,'shenzhen':50000,'Hangzhou':20000,'Guangzhou':45000,'Suzhou':None}
apts=pd.Series(cities,name='income')
apts['shenzhen']=70000
less_than_50000=(apts<50000)
apts[less_than_50000]=40000
apts2=pd.Series({'Beijing':10000,'Shanghai':8000,'shenzhen':6000,'Tianjin':40000,'Guangzhou':7000,'Chongqing':30000})
apts=apts+apts2
apts[apts.isnull()]=apts.mean()
df=pd.DataFrame({'apts':apts,'apts2':apts2})
df['bonus']=2000  
df['income']=df['apts']*2+df['apts2']*1.5+df['bonus']
#print(df)
df.to_csv('df.csv')
df.to_csv('df2.csv',index=False) #去掉第一列,行索引列

import os
df2_site = r"D:\PYTHON35\idle\df2.csv"
pwd = os.getcwd()  #获取当前工作目录
os.chdir(os.path.dirname(df2_site))
tmp_df = pd.read_csv(os.path.basename(df2_site))   ###
print(tmp_df)
      apts    apts2  bonus    income
0  65000.0  10000.0   2000  147000.0
1  64000.0  30000.0   2000  175000.0
2  47000.0   7000.0   2000  106500.0
3  64000.0      NaN   2000       NaN
4  68000.0   8000.0   2000  150000.0
5  64000.0      NaN   2000       NaN
6  64000.0  40000.0   2000  190000.0
7  76000.0   6000.0   2000  163000.0
tmp_df_index = pd.Index(['Beijing','Shanghai',"Suzhou",'Hangzhou','Tianjin','Chongqing','Nanjing','Shenzhen'])
tmp_df.index=tmp_df_index   #修改索引
print(tmp_df)
              apts    apts2  bonus    income
Beijing    65000.0  10000.0   2000  147000.0
Shanghai   64000.0  30000.0   2000  175000.0
Suzhou     47000.0   7000.0   2000  106500.0
Hangzhou   64000.0      NaN   2000       NaN
Tianjin    68000.0   8000.0   2000  150000.0
Chongqing  64000.0      NaN   2000       NaN
Nanjing    64000.0  40000.0   2000  190000.0
Shenzhen   76000.0   6000.0   2000  163000.0
df.to_csv('df3.csv',sep='\t')
  • 11
    点赞
  • 68
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 你可以使用 `pd.to_numeric` 函数来实现这个功能。首先你需要导入 pandas ,然后使用如下代码即可将 DataFrame 中的金额字符串转换成整数。 ``` import pandas as pd # 假设你的 DataFrame 叫做 df,你想要转换的列叫做 "amount" df["amount"] = pd.to_numeric(df["amount"]) ``` 如果你想要将转换后的数据类型设置为 int,可以使用如下代码: ``` df["amount"] = pd.to_numeric(df["amount"], downcast="integer") ``` 这样就可以将 DataFrame 中的金额字符串转换为整数了。 ### 回答2: 要将金额字符串转换为整数,可以使用pandasDataFrame中的apply函数和lambda表达式来实现。 首先,需要创建一个用于存储金额字符串的列的DataFrame。然后,使用apply函数将lambda表达式应用于该DataFrame的该列。在lambda表达式中,可以使用Python的内置函数int()将金额字符串转换为整数。最后,将转换后的结果保存回原始的DataFrame中。 下面是具体的实现代码: ```python import pandas as pd # 创建包含金额字符串的DataFrame df = pd.DataFrame({'金额': ['100.00', '200.50', '300.75']}) # 使用apply函数和lambda表达式将金额字符串转换为整数,并保存回原始的DataFrame df['金额'] = df['金额'].apply(lambda x: int(float(x) * 100)) print(df) ``` 输出结果如下: ``` 金额 0 10000 1 20050 2 30075 ``` 以上代码首先创建了一个包含金额字符串的DataFrame `df`,然后使用`apply`函数和`lambda`表达式将金额字符串转换为整数,并将结果保存回原始的DataFrame。转换的过程中,首先使用`float`函数将字符串转换为浮点数,然后乘以100,并使用`int`函数将结果转换为整数。最后,输出转换后的DataFrame。 ### 回答3: 在Python3pandas中,可以使用astype()方法将金额字符串转换为整数。 首先,确保金额字符串的格式是合适的,如"100.00"。然后,可通过以下步骤将其转换为整数。 首先,使用pandas的read_csv()方法导入包含金额字符串的DataFrame。假设该DataFrame的名称是df。 然后,使用strip()方法删除金额字符串中的空格、逗号等特殊字符。 接下来,使用astype()方法将金额字符串转换为float类型。 最后,使用round()方法四舍五入保留两位小数,并乘以100将金额转换为整数。 以下是一个示例代码: ```python import pandas as pd # 示例数据 data = {'金额':['100.00', '200.50', '300.75']} df = pd.DataFrame(data) # 将金额字符串转换为整数 df['金额'] = df['金额'].str.strip().astype(float).round(2) * 100 df['金额'] = df['金额'].astype(int) print(df) ``` 输出结果: ``` 金额 0 10000 1 20050 2 30075 ``` 以上示例代码将金额字符串转换为整数,并显示在DataFrame中。输出结果中的金额已经成功转换为整数。 需要注意的是,根据金额字符串的具体格式和需求,以上示例代码可能需要适当修改。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值