特征工程

特征工程

数据集来源于Data Hackathon 3.x,所有的特征处理也只做最基本的参考,可自行尝试更多的特征工程工作,参考github里Feature engineering和Kaggle Titanic的案例。

加载需要的库:

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
In [2]:
#载入数据:
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
In [3]:
train.shape, test.shape
Out[3]:
((87020, 26), (37717, 24))

看看数据的基本情况

In [4]:
train.dtypes
Out[4]:
ID                        object
Gender                    object
City                      object
Monthly_Income             int64
DOB                       object
Lead_Creation_Date        object
Loan_Amount_Applied      float64
Loan_Tenure_Applied      float64
Existing_EMI             float64
Employer_Name             object
Salary_Account            object
Mobile_Verified           object
Var5                       int64
Var1                      object
Loan_Amount_Submitted    float64
Loan_Tenure_Submitted    float64
Interest_Rate            float64
Processing_Fee           float64
EMI_Loan_Submitted       float64
Filled_Form               object
Device_Type               object
Var2                      object
Source                    object
Var4                       int64
LoggedIn                   int64
Disbursed                  int64
dtype: object

拿前5条出来看看

In [5]:
train.head(5)
Out[5]:
  ID Gender City Monthly_Income DOB Lead_Creation_Date Loan_Amount_Applied Loan_Tenure_Applied Existing_EMI Employer_Name ... Interest_Rate Processing_Fee EMI_Loan_Submitted Filled_Form Device_Type Var2 Source Var4 LoggedIn Disbursed
0 ID000002C20 Female Delhi 20000 23-May-78 15-May-15 300000.0 5.0 0.0 CYBOSOL ... NaN NaN NaN N Web-browser G S122 1 0 0
1 ID000004E40 Male Mumbai 35000 07-Oct-85 04-May-15 200000.0 2.0 0.0 TATA CONSULTANCY SERVICES LTD (TCS) ... 13.25 NaN 6762.9 N Web-browser G S122 3 0 0
2 ID000007H20 Male Panchkula 22500 10-Oct-81 19-May-15 600000.0 4.0 0.0 ALCHEMIST HOSPITALS LTD ... NaN NaN NaN N Web-browser B S143 1 0 0
3 ID000008I30 Male Saharsa 35000 30-Nov-87 09-May-15 1000000.0 5.0 0.0 BIHAR GOVERNMENT ... NaN NaN NaN N Web-browser B S143 3 0 0
4 ID000009J40 Male Bengaluru 100000 17-Feb-84 20-May-15 500000.0 2.0 25000.0 GLOBAL EDGE SOFTWARE ... NaN NaN NaN N Web-browser B S134 3 1 0

5 rows × 26 columns

In [6]:
#合成一个总的data
train['source']= 'train'
test['source'] = 'test'
data=pd.concat([train, test],ignore_index=True)
data.shape
Out[6]:
(124737, 27)

数据应用/建模一个很重要的工作是,你要看看异常点,比如说缺省值

In [7]:
data.apply(lambda x: sum(x.isnull()))
Out[7]:
City                      1401
DOB                          0
Device_Type                  0
Disbursed                37717
EMI_Loan_Submitted       84901
Employer_Name              113
Existing_EMI               111
Filled_Form                  0
Gender                       0
ID                           0
Interest_Rate            84901
Lead_Creation_Date           0
Loan_Amount_Applied        111
Loan_Amount_Submitted    49535
Loan_Tenure_Applied        111
Loan_Tenure_Submitted    49535
LoggedIn                 37717
Mobile_Verified              0
Monthly_Income               0
Processing_Fee           85346
Salary_Account           16801
Source                       0
Var1                         0
Var2                         0
Var4                         0
Var5                         0
source                       0
dtype: int64

要对数据有更深的认识,比如说,咱们看看这些字段,分别有多少种取值(甚至你可以看看分布)

In [8]:
var = ['Gender','Salary_Account','Mobile_Verified','Var1','Filled_Form','Device_Type','Var2','Source']
for v in var:
    print '\n%s这一列数据的不同取值和出现的次数\n'%v
    print data[v].value_counts()
Gender这一列数据的不同取值和出现的次数

Male      71398
Female    53339
Name: Gender, dtype: int64

Salary_Account这一列数据的不同取值和出现的次数

HDFC Bank                                          25180
ICICI Bank                                         19547
State Bank of India                                17110
Axis Bank                                          12590
Citibank                                            3398
Kotak Bank                                          2955
IDBI Bank                                           2213
Punjab National Bank                                1747
Bank of India                                       1713
Bank of Baroda                                      1675
Standard Chartered Bank                             1434
Canara Bank                                         1385
Union Bank of India                                 1330
Yes Bank                                            1120
ING Vysya                                            996
Corporation bank                                     948
Indian Overseas Bank                                 901
State Bank of Hyderabad                              854
Indian Bank                                          773
Oriental Bank of Commerce                            761
IndusInd Bank                                        711
Andhra Bank                                          706
Central Bank of India                                648
Syndicate Bank                                       614
Bank of Maharasthra                                  576
HSBC                                                 474
State Bank of Bikaner & Jaipur                       448
Karur Vysya Bank                                     435
State Bank of Mysore                                 385
Federal Bank                                         377
Vijaya Bank                                          354
Allahabad Bank                                       345
UCO Bank                                             344
State Bank of Travancore                             333
Karnataka Bank                                       279
United Bank of India                                 276
Dena Bank                                            268
Saraswat Bank                                        265
State Bank of Patiala                                263
South Indian Bank                                    223
Deutsche Bank                                        176
Abhyuday Co-op Bank Ltd                              161
The Ratnakar Bank Ltd                                113
Tamil Nadu Mercantile Bank                           103
Punjab & Sind bank                                    84
J&K Bank                                              78
Lakshmi Vilas bank                                    69
Dhanalakshmi Bank Ltd                                 66
State Bank of Indore                                  32
Catholic Syrian Bank                                  27
India Bulls                                           21
B N P Paribas                                         15
Firstrand Bank Limited                                11
GIC Housing Finance Ltd                               10
Bank of Rajasthan                                      8
Kerala Gramin Bank                                     4
Industrial And Commercial Bank Of China Limited        3
Ahmedabad Mercantile Cooperative Bank                  1
Name: Salary_Account, dtype: int64

Mobile_Verified这一列数据的不同取值和出现的次数

Y    80928
N    43809
Name: Mobile_Verified, dtype: int64

Var1这一列数据的不同取值和出现的次数

HBXX    84901
HBXC    12952
HBXB     6502
HAXA     4214
HBXA     3042
HAXB     2879
HBXD     2818
HAXC     2171
HBXH     1387
HCXF      990
HAYT      710
HAVC      570
HAXM      386
HCXD      348
HCYS      318
HVYS      252
HAZD      161
HCXG      114
HAXF       22
Name: Var1, dtype: int64

Filled_Form这一列数据的不同取值和出现的次数

N    96740
Y    27997
Name: Filled_Form, dtype: int64

Device_Type这一列数据的不同取值和出现的次数

Web-browser    92105
Mobile         32632
Name: Device_Type, dtype: int64

Var2这一列数据的不同取值和出现的次数

B    53481
G    47338
C    20366
E     1855
D      918
F      770
A        9
Name: Var2, dtype: int64

Source这一列数据的不同取值和出现的次数

S122    55249
S133    42900
S159     7999
S143     6140
S127     2804
S137     2450
S134     1900
S161     1109
S151     1018
S157      929
S153      705
S144      447
S156      432
S158      294
S123      112
S141       83
S162       60
S124       43
S150       19
S160       11
S136        5
S138        5
S155        5
S139        4
S129        4
S135        2
S131        1
S130        1
S132        1
S125        1
S140        1
S142        1
S126        1
S154        1
Name: Source, dtype: int64

紧接着你就可以开始处理你的字段(特征)了

我这里只做了一些简单的处理,你大可在我的基础上做更复杂的特征处理

City字段处理

In [9]:
len(data['City'].unique())
Out[9]:
724
好像city的类型好多,粗暴一点,这个字段咱们不要了
In [10]:
data.drop('City',axis=1,inplace=True)

DOB字段处理

DOB是出生的具体日期,咱们要具体日期作用没那么大,年龄段可能对我们有用,所有算一下年龄好了

In [11]:
data['DOB'].head()
Out[11]:
0    23-May-78
1    07-Oct-85
2    10-Oct-81
3    30-Nov-87
4    17-Feb-84
Name: DOB, dtype: object
In [12]:
#创建一个年龄的字段Age
data['Age'] = data['DOB'].apply(lambda x: 115 - int(x[-2:]))
data['Age'].head()
Out[12]:
0    37
1    30
2    34
3    28
4    31
Name: Age, dtype: int64
In [13]:
#把原始的DOB字段去掉:
data.drop('DOB',axis=1,inplace=True)

EMI_Load_Submitted字段处理

In [14]:
data.boxplot(column=['EMI_Loan_Submitted'],return_type='axes')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a00d550>
In [15]:
#好像缺失值比较多,干脆就开一个新的字段,表明是缺失值还是不是缺失值
data['EMI_Loan_Submitted_Missing'] = data['EMI_Loan_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data[['EMI_Loan_Submitted','EMI_Loan_Submitted_Missing']].head(10)
Out[15]:
  EMI_Loan_Submitted EMI_Loan_Submitted_Missing
0 NaN 1
1 6762.90 0
2 NaN 1
3 NaN 1
4 NaN 1
5 6978.92 0
6 NaN 1
7 NaN 1
8 30824.65 0
9 10883.38 0
In [16]:
#原始那一列就可以不要了
data.drop('EMI_Loan_Submitted',axis=1,inplace=True)

Employer Name字段处理

看看个数
In [17]:
len(data['Employer_Name'].value_counts())
Out[17]:
57193
不看也知道,每个人都有一个名字,太多了,懒癌晚期的同学直接drop掉了
In [18]:
#丢掉
data.drop('Employer_Name',axis=1,inplace=True)

Existing_EMI字段

In [19]:
data.boxplot(column='Existing_EMI',return_type='axes')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x109a2c310>
In [21]:
data['Existing_EMI'].describe()
Out[21]:
count    1.246260e+05
mean     3.636342e+03
std      3.369124e+04
min      0.000000e+00
25%               NaN
50%               NaN
75%               NaN
max      1.000000e+07
Name: Existing_EMI, dtype: float64
In [22]:
#缺省值不多,用均值代替
data['Existing_EMI'].fillna(0, inplace=True)

Interest_Rate字段:

In [23]:
data.boxplot(column=['Interest_Rate'],return_type='axes')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a05f250>
In [24]:
#缺省值太多,也造一个字段,表示有无
data['Interest_Rate_Missing'] = data['Interest_Rate'].apply(lambda x: 1 if pd.isnull(x) else 0)
print data[['Interest_Rate','Interest_Rate_Missing']].head(10)
   Interest_Rate  Interest_Rate_Missing
0            NaN                      1
1          13.25                      0
2            NaN                      1
3            NaN                      1
4            NaN                      1
5          13.99                      0
6            NaN                      1
7            NaN                      1
8          14.85                      0
9          18.25                      0
In [25]:
data.drop('Interest_Rate',axis=1,inplace=True)

Lead Creation Date字段

In [26]:
#不!要!了!,是的,不要了!!!
data.drop('Lead_Creation_Date',axis=1,inplace=True)
data.head()
Out[26]:
  Device_Type Disbursed Existing_EMI Filled_Form Gender ID Loan_Amount_Applied Loan_Amount_Submitted Loan_Tenure_Applied Loan_Tenure_Submitted ... Salary_Account Source Var1 Var2 Var4 Var5 source Age EMI_Loan_Submitted_Missing Interest_Rate_Missing
0 Web-browser 0.0 0.0 N Female ID000002C20 300000.0 NaN 5.0 NaN ... HDFC Bank S122 HBXX G 1 0 train 37 1 1
1 Web-browser 0.0 0.0 N Male ID000004E40 200000.0 200000.0 2.0 2.0 ... ICICI Bank S122 HBXA G 3 13 train 30 0 0
2 Web-browser 0.0 0.0 N Male ID000007H20 600000.0 450000.0 4.0 4.0 ... State Bank of India S143 HBXX B 1 0 train 34 1 1
3 Web-browser 0.0 0.0 N Male ID000008I30 1000000.0 920000.0 5.0 5.0 ... State Bank of India S143 HBXX B 3 10 train 28 1 1
4 Web-browser 0.0 25000.0 N Male ID000009J40 500000.0 500000.0 2.0 2.0 ... HDFC Bank S134 HBXX B 3 17 train 31 1 1

5 rows × 24 columns

Loan Amount and Tenure applied字段

In [27]:
#找中位数去填补缺省值(因为缺省的不多)
data['Loan_Amount_Applied'].fillna(data['Loan_Amount_Applied'].median(),inplace=True)
data['Loan_Tenure_Applied'].fillna(data['Loan_Tenure_Applied'].median(),inplace=True)
In [28]:
data.head()
Out[28]:
  Device_Type Disbursed Existing_EMI Filled_Form Gender ID Loan_Amount_Applied Loan_Amount_Submitted Loan_Tenure_Applied Loan_Tenure_Submitted ... Salary_Account Source Var1 Var2 Var4 Var5 source Age EMI_Loan_Submitted_Missing Interest_Rate_Missing
0 Web-browser 0.0 0.0 N Female ID000002C20 300000.0 NaN 5.0 NaN ... HDFC Bank S122 HBXX G 1 0 train 37 1 1
1 Web-browser 0.0 0.0 N Male ID000004E40 200000.0 200000.0 2.0 2.0 ... ICICI Bank S122 HBXA G 3 13 train 30 0 0
2 Web-browser 0.0 0.0 N Male ID000007H20 600000.0 450000.0 4.0 4.0 ... State Bank of India S143 HBXX B 1 0 train 34 1 1
3 Web-browser 0.0 0.0 N Male ID000008I30 1000000.0 920000.0 5.0 5.0 ... State Bank of India S143 HBXX B 3 10 train 28 1 1
4 Web-browser 0.0 25000.0 N Male ID000009J40 500000.0 500000.0 2.0 2.0 ... HDFC Bank S134 HBXX B 3 17 train 31 1 1

5 rows × 24 columns

Loan Amount and Tenure selected

In [29]:
# 缺省值太多。。。是否缺省。。。
data['Loan_Amount_Submitted_Missing'] = data['Loan_Amount_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
data['Loan_Tenure_Submitted_Missing'] = data['Loan_Tenure_Submitted'].apply(lambda x: 1 if pd.isnull(x) else 0)
In [45]:
data.head()
Out[45]:
  Device_Type Disbursed Employer_Name Existing_EMI Filled_Form Gender ID Loan_Amount_Applied Loan_Amount_Submitted Loan_Tenure_Applied ... Var1 Var2 Var4 Var5 source Age EMI_Loan_Submitted_Missing Interest_Rate_Missing Loan_Amount_Submitted_Missing Loan_Tenure_Submitted_Missing
0 Web-browser 0.0 CYBOSOL 0.0 N Female ID000002C20 300000.0 NaN 5.0 ... HBXX G 1 0 train 37 1 1 1 1
1 Web-browser 0.0 TATA CONSULTANCY SERVICES LTD (TCS) 0.0 N Male ID000004E40 200000.0 200000.0 2.0 ... HBXA G 3 13 train 30 0 0 0 0
2 Web-browser 0.0 ALCHEMIST HOSPITALS LTD 0.0 N Male ID000007H20 600000.0 450000.0 4.0 ... HBXX B 1 0 train 34 1 1 0 0
3 Web-browser 0.0 BIHAR GOVERNMENT 0.0 N Male ID000008I30 1000000.0 920000.0 5.0 ... HBXX B 3 10 train 28 1 1 0 0
4 Web-browser 0.0 GLOBAL EDGE SOFTWARE 25000.0 N Male ID000009J40 500000.0 500000.0 2.0 ... HBXX B 3 17 train 31 1 1 0 0

5 rows × 27 columns

In [30]:
#原来的字段就没用了
data.drop(['Loan_Amount_Submitted','Loan_Tenure_Submitted'],axis=1,inplace=True)

LoggedIn

In [31]:
#没想好怎么用。。。不要了。。。
data.drop('LoggedIn',axis=1,inplace=True)

salary account

In [32]:
# 可能对接多个银行,所以也不要了
data.drop('Salary_Account',axis=1,inplace=True)

Processing_Fee

In [33]:
#和之前一样的处理,有或者没有
data['Processing_Fee_Missing'] = data['Processing_Fee'].apply(lambda x: 1 if pd.isnull(x) else 0)
#旧的字段不要了
data.drop('Processing_Fee',axis=1,inplace=True)

Source

In [34]:
data['Source'] = data['Source'].apply(lambda x: 'others' if x not in ['S122','S133'] else x)
data['Source'].value_counts()
Out[34]:
S122      55249
S133      42900
others    26588
Name: Source, dtype: int64

最终的数据样式

In [35]:
data.head()
Out[35]:
  Device_Type Disbursed Existing_EMI Filled_Form Gender ID Loan_Amount_Applied Loan_Tenure_Applied Mobile_Verified Monthly_Income ... Var2 Var4 Var5 source Age EMI_Loan_Submitted_Missing Interest_Rate_Missing Loan_Amount_Submitted_Missing Loan_Tenure_Submitted_Missing Processing_Fee_Missing
0 Web-browser 0.0 0.0 N Female ID000002C20 300000.0 5.0 N 20000 ... G 1 0 train 37 1 1 1 1 1
1 Web-browser 0.0 0.0 N Male ID000004E40 200000.0 2.0 Y 35000 ... G 3 13 train 30 0 0 0 0 1
2 Web-browser 0.0 0.0 N Male ID000007H20 600000.0 4.0 Y 22500 ... B 1 0 train 34 1 1 0 0 1
3 Web-browser 0.0 0.0 N Male ID000008I30 1000000.0 5.0 Y 35000 ... B 3 10 train 28 1 1 0 0 1
4 Web-browser 0.0 25000.0 N Male ID000009J40 500000.0 2.0 Y 100000 ... B 3 17 train 31 1 1 0 0 1

5 rows × 22 columns

In [36]:
data.describe()
Out[36]:
  Disbursed Existing_EMI Loan_Amount_Applied Loan_Tenure_Applied Monthly_Income Var4 Var5 Age EMI_Loan_Submitted_Missing Interest_Rate_Missing Loan_Amount_Submitted_Missing Loan_Tenure_Submitted_Missing Processing_Fee_Missing
count 87020.000000 1.247370e+05 1.247370e+05 124737.000000 1.247370e+05 124737.000000 124737.000000 124737.000000 124737.000000 124737.000000 124737.000000 124737.000000 124737.000000
mean 0.014629 3.633107e+03 2.298744e+05 2.138075 5.309073e+04 2.950560 4.964774 30.906996 0.680640 0.680640 0.397116 0.397116 0.684208
std 0.120062 3.367642e+04 3.539938e+05 2.014874 1.823394e+06 1.695261 5.669784 7.137860 0.466231 0.466231 0.489302 0.489302 0.464833
min 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000 0.000000 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% NaN 0.000000e+00 0.000000e+00 0.000000 1.650000e+04 1.000000 0.000000 26.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% NaN 0.000000e+00 1.000000e+05 2.000000 2.500000e+04 3.000000 2.000000 29.000000 1.000000 1.000000 0.000000 0.000000 1.000000
75% NaN 3.500000e+03 3.000000e+05 4.000000 4.000000e+04 5.000000 11.000000 34.000000 1.000000 1.000000 1.000000 1.000000 1.000000
max 1.000000 1.000000e+07 1.500000e+07 10.000000 4.445544e+08 7.000000 18.000000 100.000000 1.000000 1.000000 1.000000 1.000000 1.000000
In [37]:
data.apply(lambda x: sum(x.isnull()))
Out[37]:
Device_Type                          0
Disbursed                        37717
Existing_EMI                         0
Filled_Form                          0
Gender                               0
ID                                   0
Loan_Amount_Applied                  0
Loan_Tenure_Applied                  0
Mobile_Verified                      0
Monthly_Income                       0
Source                               0
Var1                                 0
Var2                                 0
Var4                                 0
Var5                                 0
source                               0
Age                                  0
EMI_Loan_Submitted_Missing           0
Interest_Rate_Missing                0
Loan_Amount_Submitted_Missing        0
Loan_Tenure_Submitted_Missing        0
Processing_Fee_Missing               0
dtype: int64
In [38]:
data.dtypes
Out[38]:
Device_Type                       object
Disbursed                        float64
Existing_EMI                     float64
Filled_Form                       object
Gender                            object
ID                                object
Loan_Amount_Applied              float64
Loan_Tenure_Applied              float64
Mobile_Verified                   object
Monthly_Income                     int64
Source                            object
Var1                              object
Var2                              object
Var4                               int64
Var5                               int64
source                            object
Age                                int64
EMI_Loan_Submitted_Missing         int64
Interest_Rate_Missing              int64
Loan_Amount_Submitted_Missing      int64
Loan_Tenure_Submitted_Missing      int64
Processing_Fee_Missing             int64
dtype: object

数值编码

In [39]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
var_to_encode = ['Device_Type','Filled_Form','Gender','Var1','Var2','Mobile_Verified','Source']
for col in var_to_encode:
    data[col] = le.fit_transform(data[col])
In [40]:
data.head()
Out[40]:
  Device_Type Disbursed Existing_EMI Filled_Form Gender ID Loan_Amount_Applied Loan_Tenure_Applied Mobile_Verified Monthly_Income ... Var2 Var4 Var5 source Age EMI_Loan_Submitted_Missing Interest_Rate_Missing Loan_Amount_Submitted_Missing Loan_Tenure_Submitted_Missing Processing_Fee_Missing
0 1 0.0 0.0 0 0 ID000002C20 300000.0 5.0 0 20000 ... 6 1 0 train 37 1 1 1 1 1
1 1 0.0 0.0 0 1 ID000004E40 200000.0 2.0 1 35000 ... 6 3 13 train 30 0 0 0 0 1
2 1 0.0 0.0 0 1 ID000007H20 600000.0 4.0 1 22500 ... 1 1 0 train 34 1 1 0 0 1
3 1 0.0 0.0 0 1 ID000008I30 1000000.0 5.0 1 35000 ... 1 3 10 train 28 1 1 0 0 1
4 1 0.0 25000.0 0 1 ID000009J40 500000.0 2.0 1 100000 ... 1 3 17 train 31 1 1 0 0 1

5 rows × 22 columns

In [41]:
data.dtypes
Out[41]:
Device_Type                        int64
Disbursed                        float64
Existing_EMI                     float64
Filled_Form                        int64
Gender                             int64
ID                                object
Loan_Amount_Applied              float64
Loan_Tenure_Applied              float64
Mobile_Verified                    int64
Monthly_Income                     int64
Source                             int64
Var1                               int64
Var2                               int64
Var4                               int64
Var5                               int64
source                            object
Age                                int64
EMI_Loan_Submitted_Missing         int64
Interest_Rate_Missing              int64
Loan_Amount_Submitted_Missing      int64
Loan_Tenure_Submitted_Missing      int64
Processing_Fee_Missing             int64
dtype: object

类别型的One-Hot 编码

In [42]:
data = pd.get_dummies(data, columns=var_to_encode)
data.columns
Out[42]:
Index([u'Disbursed', u'Existing_EMI', u'ID', u'Loan_Amount_Applied',
       u'Loan_Tenure_Applied', u'Monthly_Income', u'Var4', u'Var5', u'source',
       u'Age', u'EMI_Loan_Submitted_Missing', u'Interest_Rate_Missing',
       u'Loan_Amount_Submitted_Missing', u'Loan_Tenure_Submitted_Missing',
       u'Processing_Fee_Missing', u'Device_Type_0', u'Device_Type_1',
       u'Filled_Form_0', u'Filled_Form_1', u'Gender_0', u'Gender_1', u'Var1_0',
       u'Var1_1', u'Var1_2', u'Var1_3', u'Var1_4', u'Var1_5', u'Var1_6',
       u'Var1_7', u'Var1_8', u'Var1_9', u'Var1_10', u'Var1_11', u'Var1_12',
       u'Var1_13', u'Var1_14', u'Var1_15', u'Var1_16', u'Var1_17', u'Var1_18',
       u'Var2_0', u'Var2_1', u'Var2_2', u'Var2_3', u'Var2_4', u'Var2_5',
       u'Var2_6', u'Mobile_Verified_0', u'Mobile_Verified_1', u'Source_0',
       u'Source_1', u'Source_2'],
      dtype='object')

区分训练和测试数据

In [43]:
train = data.loc[data['source']=='train']
test = data.loc[data['source']=='test']
In [44]:
train.drop('source',axis=1,inplace=True)
test.drop(['source','Disbursed'],axis=1,inplace=True)
/Library/Python/2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/Library/Python/2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
In [45]:
train.to_csv('train_modified.csv',index=False)
test.to_csv('test_modified.csv',index=False)
In [ ]:
 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值