2020国赛C中小微企业的信贷策略数据预处理代码

本题的思路做法有很多,本文不再赘述,只进行简单的数据预处理为后面几问分析做基础
本文目标是用python提取数据整理出最基础的数据

原题一共有3个附件,我们主要针对前两个附件进行处理

  • 附件1 123家有信贷记录企业的相关数据
  • 附件2 302家无信贷记录企业的相关数据
  • 附件3 银行贷款年利率与客户流失率关系的2019年统计数据

整理目标指标体系:
进项:单位个数,月均价税合计累计,负金额个数,月均负金额累计,作废发票个数,月均作废金额累计
销项:单位个数,月均价税合计累计,负金额个数,月均负金额累计,作废发票个数,月均作废金额累计

123家进项:

import numpy as np
import pandas as pd
import os
data1=pd.read_excel('data1.xlsx','销项发票信息')
data1
企业代号发票号码开票日期购方单位代号金额税额价税合计发票状态
0E1114593562017-08-04B037119401.711598.2911000.0有效发票
1E150762392017-08-09B008448170.941389.069560.0有效发票
2E150762402017-08-09B008448170.941389.069560.0有效发票
3E150762412017-08-09B008444085.47694.534780.0有效发票
4E150762422017-08-09B008444085.47694.534780.0有效发票
...........................
162479E12388877012019-12-17B109444827.67144.834972.5有效发票
162480E12388877022019-12-17B109447412.62222.387635.0有效发票
162481E123341730852019-12-17B130931917.4757.531975.0有效发票
162482E12388877032019-12-25B130937252.42217.587470.0有效发票
162483E12388877042019-12-25B130936660.19199.816860.0有效发票

162484 rows × 8 columns

data1['发票状态'] =data1['发票状态'].map({'作废发票':0,'有效发票':1})
data1['发票状态'].value_counts()
1.0    151278
0.0     11159
Name: 发票状态, dtype: int64
data1
企业代号发票号码开票日期购方单位代号金额税额价税合计发票状态
0E1114593562017-08-04B037119401.711598.2911000.01.0
1E150762392017-08-09B008448170.941389.069560.01.0
2E150762402017-08-09B008448170.941389.069560.01.0
3E150762412017-08-09B008444085.47694.534780.01.0
4E150762422017-08-09B008444085.47694.534780.01.0
...........................
162479E12388877012019-12-17B109444827.67144.834972.51.0
162480E12388877022019-12-17B109447412.62222.387635.01.0
162481E123341730852019-12-17B130931917.4757.531975.01.0
162482E12388877032019-12-25B130937252.42217.587470.01.0
162483E12388877042019-12-25B130936660.19199.816860.01.0

162484 rows × 8 columns

data1['开票日期']=data1['开票日期'].map(lambda x: 100*x.year + x.month)
data1
企业代号发票号码开票日期购方单位代号金额税额价税合计发票状态
0E111459356201708B037119401.711598.2911000.01.0
1E15076239201708B008448170.941389.069560.01.0
2E15076240201708B008448170.941389.069560.01.0
3E15076241201708B008444085.47694.534780.01.0
4E15076242201708B008444085.47694.534780.01.0
...........................
162479E1238887701201912B109444827.67144.834972.51.0
162480E1238887702201912B109447412.62222.387635.01.0
162481E12334173085201912B130931917.4757.531975.01.0
162482E1238887703201912B130937252.42217.587470.01.0
162483E1238887704201912B130936660.19199.816860.01.0

162484 rows × 8 columns

data1.groupby(['企业代号']).agg('count')
发票号码开票日期购方单位代号金额税额价税合计发票状态
企业代号
E18110811081108110811081108110
E10568568568568568568568
E10099999999999999
E10163636363636363
E102177177177177177177177
........................
E95913913913913913913913
E9620202020202020
E97264264264264264264264
E98153153153153153153153
E998888888

123 rows × 7 columns

dd=pd.pivot_table(data1, index=['企业代号','开票日期'],aggfunc=['count'])
#dd.to_excel('dd.xlsx')
data1.groupby(['企业代号'],sort=False).agg('count')
发票号码开票日期购方单位代号金额税额价税合计发票状态
企业代号
E18110811081108110811081108110
E212707127071270712707127071270712707
E324073240732407324073240732407324073
E42231223122312231223122312231
E51060106010601060106010601060
........................
E11921212121212121
E12029292929292929
E121186186186186186186186
E122118118118118118118118
E12365656565656565

123 rows × 7 columns

index=data1['企业代号']
index=index.drop_duplicates()
index
0           E1
8110        E2
20817       E3
44890       E4
47121       E5
          ... 
162065    E119
162086    E120
162115    E121
162301    E122
162419    E123
Name: 企业代号, Length: 123, dtype: object
index.iloc[1]
'E2'
count=[0 for i in range(len(index))]
for i in range(len(data1['企业代号'])):
    for j in range(len(index)):
        if index.iloc[j] == data1['企业代号'][i]:
            count[j]+=1
            continue
count
[8110,
 12707,
 24073,
 2231,
 1060,
 1052,
 8149,
 9425,
 5906,
 568,
 1117,
 285,
 8005,
 3336,
 2412,
 439,
 676,
 381,
 3099,
 777,
 1466,
 1562,
 3570,
 1782,
 978,
 504,
 1468,
 569,
 475,
 800,
 2608,
 552,
 1073,
 707,
 770,
 1547,
 1273,
 628,
 264,
 883,
 663,
 327,
 245,
 2435,
 325,
 974,
 3518,
 1016,
 513,
 285,
 895,
 379,
 454,
 3501,
 590,
 495,
 438,
 388,
 2149,
 513,
 1992,
 219,
 673,
 1375,
 485,
 679,
 215,
 73,
 587,
 578,
 375,
 248,
 236,
 555,
 8655,
 245,
 258,
 268,
 86,
 48,
 649,
 343,
 706,
 498,
 293,
 22,
 74,
 400,
 172,
 261,
 129,
 39,
 148,
 196,
 913,
 20,
 264,
 153,
 8,
 99,
 63,
 177,
 64,
 21,
 122,
 153,
 54,
 33,
 33,
 83,
 250,
 22,
 47,
 75,
 6,
 47,
 44,
 144,
 21,
 29,
 186,
 118,
 65]
index2=index.drop_duplicates()
index2=index2.reset_index(drop=True)
unitcount=pd.Series(count)
namecop=pd.Series(index2)
df=pd.concat([namecop,unitcount],axis=1)
df.columns=['企业代号','发票数']
df
企业代号发票数
0E18110
1E212707
2E324073
3E42231
4E51060
.........
118E11921
119E12029
120E121186
121E122118
122E12365

123 rows × 2 columns

newdata=data1.set_index('企业代号')#将企业代号作为列索引
newdata
发票号码开票日期购方单位代号金额税额价税合计发票状态
企业代号
E111459356201708B037119401.711598.2911000.01.0
E15076239201708B008448170.941389.069560.01.0
E15076240201708B008448170.941389.069560.01.0
E15076241201708B008444085.47694.534780.01.0
E15076242201708B008444085.47694.534780.01.0
........................
E1238887701201912B109444827.67144.834972.51.0
E1238887702201912B109447412.62222.387635.01.0
E12334173085201912B130931917.4757.531975.01.0
E1238887703201912B130937252.42217.587470.01.0
E1238887704201912B130936660.19199.816860.01.0

162484 rows × 7 columns

index=index2
index
0        E1
1        E2
2        E3
3        E4
4        E5
       ... 
118    E119
119    E120
120    E121
121    E122
122    E123
Name: 企业代号, Length: 123, dtype: object
df
企业代号发票数
0E18110
1E212707
2E324073
3E42231
4E51060
.........
118E11921
119E12029
120E121186
121E122118
122E12365

123 rows × 2 columns

aa=[]
for i in index:
    pp=newdata.loc[i]#按照企业代号进行处理
    daytemp=pp['开票日期']
    daytemp=pd.Series(daytemp)
    day_index=daytemp.drop_duplicates()#去重
    a=day_index.count()#计算总共有多少个月份
    aa.append(a)
len(aa)

day=pd.Series(aa)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数']
df
企业代号发票数月数
0E1811030
1E21270737
2E32407338
3E4223135
4E5106033
............
118E1192110
119E120296
120E12118624
121E12211826
122E1236515

123 rows × 3 columns

d#f.drop(df.columns[0], axis=1, inplace=True)    # 删除第1列
#df
0
1
2
3
4
...
162479
162480
162481
162482
162483

162484 rows × 0 columns

#销售单位代号个数
ab=[]
for i in index:
    pp=newdata.loc[i]#按照企业代号进行处理
    daytemp=pp['购方单位代号']
    daytemp=pd.Series(daytemp)
    day_index=daytemp.drop_duplicates()#去重
    a=day_index.count()#计算总共有多少个月份
    ab.append(a)
len(ab)

day=pd.Series(ab)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数']
df
企业代号发票数月数购方单位代号个数
0E1811030360
1E212707371611
2E32407338136
3E422313526
4E510603333
...............
118E119211017
119E12029626
120E12118624123
121E1221182670
122E12365157

123 rows × 4 columns

#价税合计累计
ac=[]
for i in index:
    pp=newdata.loc[i]#按照企业代号进行处理
    daytemp=pp['价税合计']
    daytemp=pd.Series(daytemp)
    c=daytemp.sum()
    ac.append(c)
len(ac)
day=pd.Series(ac)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计']
df
企业代号发票数月数购方单位代号个数价税合计累计
0E18110303604.798812e+09
1E2127073716116.947082e+08
2E324073381366.827398e+08
3E4223135262.345631e+09
4E5106033332.427882e+08
..................
118E1192110173.724200e+04
119E120296261.963934e+05
120E121186241231.177928e+05
121E12211826708.748906e+04
122E123651572.671682e+05

123 rows × 5 columns

#负金额个数
ad=[]
d=0
for i in index:
    pp=newdata.loc[i]#按照企业代号进行处理
    daytemp=pp['金额']
    daytemp=pd.Series(daytemp)
    for j in range(len(daytemp)):
        if daytemp[j]<0:
            d+=1
    ad.append(d)
    d=0
len(ad)

day=pd.Series(ad)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数
0E18110303604.798812e+09229
1E2127073716116.947082e+08360
2E324073381366.827398e+084501
3E4223135262.345631e+099
4E5106033332.427882e+0810
.....................
118E1192110173.724200e+040
119E120296261.963934e+050
120E121186241231.177928e+050
121E12211826708.748906e+048
122E123651572.671682e+050

123 rows × 6 columns

# 负金额总额
ae=[]
sum=0
for i in index:
    pp=newdata.loc[i]#按照企业代号进行处理
    daytemp=pp['金额']
    daytemp=pd.Series(daytemp)
    for j in range(len(daytemp)):
        if daytemp[j]<0:
            sum+=daytemp[j]
    ae.append(round(sum,2))
    sum=0
len(ae)
day=pd.Series(ae)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数','负金额累计']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数负金额累计
0E18110303604.798812e+09229-1.300769e+08
1E2127073716116.947082e+08360-1.909563e+07
2E324073381366.827398e+084501-2.047618e+08
3E4223135262.345631e+099-7.221601e+06
4E5106033332.427882e+0810-9.652789e+05
........................
118E1192110173.724200e+0400.000000e+00
119E120296261.963934e+0500.000000e+00
120E121186241231.177928e+0500.000000e+00
121E12211826708.748906e+048-1.952390e+03
122E123651572.671682e+0500.000000e+00

123 rows × 7 columns

#作废发票个数
af=[]
f=0
for i in index:
    pp=newdata.loc[i]#按照企业代号进行处理
    daytemp=pp['发票状态']
    daytemp=pd.Series(daytemp)
    for j in range(len(daytemp)):
        if daytemp[j]==0:
            f+=1
    af.append(f)
    f=0
len(af)
day=pd.Series(af)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数','负金额累计','作废发票个数']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数负金额累计作废发票个数
0E18110303604.798812e+09229-1.300769e+08224
1E2127073716116.947082e+08360-1.909563e+071042
2E324073381366.827398e+084501-2.047618e+08385
3E4223135262.345631e+099-7.221601e+06190
4E5106033332.427882e+0810-9.652789e+0555
...........................
118E1192110173.724200e+0400.000000e+003
119E120296261.963934e+0500.000000e+0020
120E121186241231.177928e+0500.000000e+0023
121E12211826708.748906e+048-1.952390e+0316
122E123651572.671682e+0500.000000e+0032

123 rows × 8 columns

#作废发票金额
ag=[]
gsum=0
for i in index:
    pp=newdata.loc[i]#按照企业代号进行处理
    daytemp=pp['发票状态']
    money=pp['金额']
    daytemp=pd.Series(daytemp)
    for j in range(len(daytemp)):
        if daytemp[j]==0:
            gsum+=money[j]
    ag.append(round(gsum,2))
    gsum=0
len(ag)    
day=pd.Series(ag)
df=pd.concat([df,day],axis=1)
df.columns=['企业代号','发票数','月数','购方单位代号个数','价税合计累计','负金额个数','负金额累计','作废发票个数','作废发票金额累计']
df
企业代号发票数月数购方单位代号个数价税合计累计负金额个数负金额累计作废发票个数作废发票金额累计
0E18110303604.798812e+09229-1.300769e+082248.715034e+07
1E2127073716116.947082e+08360-1.909563e+0710426.454014e+07
2E324073381366.827398e+084501-2.047618e+083851.860897e+07
3E4223135262.345631e+099-7.221601e+061901.712102e+08
4E5106033332.427882e+0810-9.652789e+05558.168982e+06
..............................
118E1192110173.724200e+0400.000000e+0034.854400e+02
119E120296261.963934e+0500.000000e+00200.000000e+00
120E121186241231.177928e+0500.000000e+00233.184300e+03
121E12211826708.748906e+048-1.952390e+03162.565260e+04
122E123651572.671682e+0500.000000e+00323.647341e+04

123 rows × 9 columns

df.to_excel('销项数据整理.xlsx')
  • 1
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值