数据情况如下,共有两个表
两个df不一样
df = pd.read_csv(‘./long-customer-train.csv’)
CustomerId CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Exited
0 15553251 713 1 52 0 185891.54 1 1 1 46369.57 1
1 15553256 619 1 41 8 0.00 3 1 1 79866.73 1
2 15553283 603 1 42 8 91611.12 1 0 0 144675.30 1
3 15553308 589 1 61 1 0.00 1 1 0 61108.56 1
4 15553387 687 1 39 2 0.00 3 0 0 188150.60 1
... ... ... ... ... ... ... ... ... ... ... ...
9295 15815628 711 1 37 8 113899.92 1 0 0 80215.20 0
9296 15815645 481 0 37 8 152303.66 2 1 1 175082.20 0
9297 15815656 541 1 39 9 100116.67 1 1 1 199808.10 1
9298 15815660 758 1 34 1 154139.45 1 1 1 60728.89 0
9299 15815690 614 1 40 3 113348.50 1 1 1 77789.01 0
df.head()
user_id age job marital education default housing loan contact month day_of_week duration poutcome y
0 BA2200001 56 housemaid married postgraduate no no no telephone may mon 261 nonexistent no
1 BA2200077 37 services married high school no yes no telephone may mon 226 nonexistent no
2 BA2200004 40 admin. married postgraduate no no no telephone may mon 151 nonexistent no
3 BA2200005 56 services married high school no no yes telephone may mon 307 nonexistent no
4 BA2200007 59 admin. married junior college no no no telephone may mon 139 nonexistent no
drop 函数
函数介绍:
df.drop(label=[1,2,3],axis=0) #删除多行数据,label传入行索引
df.drop(columns=['A','B']) #columns传入列名
实战代码:删除年龄中为-1的行
for i in list(data_train.index):
if data_train['Age'][i] == '-1':
data_train = data_train.drop([i])
drop_duplicates / duplicated 函数
df.user_id.duplicated() # 返回user_id列的布尔值,相同为true,不同为false
df.drop_duplicates(subset='user_id') #删除user_id列的重复值
dropna 函数
删除这些指标里面有缺失值的行
df = df.dropna(axis=0,subset = ["job", "marital","education","default","housing","loan"])
value_counts 函数
df['default'].value_counts() #查看分布情况,相当于词频统计
cut 函数
bin:区间,左开右闭;label:各区间的标签;x对哪一列进行
df['age'] = pd.cut(x=df["age"],bins=[16,20,40,64,95],labels=["17-20","21-40","40-64","64-95"])
groupby 函数
grouby填入分组对象;agg填入分组方式,如count,mean,sum
df.groupby(['age','y_no']).agg('count')
get_dummies 函数
独热编码
#无顺序关系的指标采用label后用独热编码
object2=['job', 'marital', 'poutcome']
for j in object2:
df=pd.get_dummies(df,columns=[j],prefix_sep='_')#名字通过_连接构造
isnull() 和notnul()
返回bool值,判断是否为空值
df = df[df['Exited'].notnull()] #取出Exited列非空的行
pivot_table 透视表
以Unnamed: 0这一列的数值分析,按照行Exited的分类情况和列Tenure的分类情况进行aggfunc统计
df3.pivot_table('Unnamed: 0',index='Exited',columns='Tenure',aggfunc='count')
apply和map函数
广播机制,与lambda函数搭配
df['user_id'] = df['user_id'].apply(lambda x:int(x[2:])) #从第二个位置开始截取并转为数字