pandas实战

a尼

已于 2022-11-17 23:30:44 修改

阅读量367

点赞数

分类专栏： python 文章标签： pandas python 数据分析

于 2022-11-17 23:12:03 首次发布

本文链接：https://blog.csdn.net/m0_56094505/article/details/127912043

版权

python 专栏收录该内容

24 篇文章 4 订阅

订阅专栏

数据情况如下，共有两个表

两个df不一样
df = pd.read_csv(‘./long-customer-train.csv’)

CustomerId 	CreditScore 	Gender 	Age 	Tenure 	Balance 	NumOfProducts 	HasCrCard 	IsActiveMember 	EstimatedSalary 	Exited
0 	15553251 	713 	1 	52 	0 	185891.54 	1 	1 	1 	46369.57 	1
1 	15553256 	619 	1 	41 	8 	0.00 		3 	1 	1 	79866.73 	1
2 	15553283 	603 	1 	42 	8 	91611.12 	1 	0 	0 	144675.30 	1
3 	15553308 	589 	1 	61 	1 	0.00 		1 	1 	0 	61108.56 	1
4 	15553387 	687 	1 	39 	2 	0.00 		3 	0 	0 	188150.60 	1
... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	... 	...
9295 	15815628 	711 	1 	37 	8 	113899.92 	1 	0 	0 	80215.20 	0
9296 	15815645 	481 	0 	37 	8 	152303.66 	2 	1 	1 	175082.20 	0
9297 	15815656 	541 	1 	39 	9 	100116.67 	1 	1 	1 	199808.10 	1
9298 	15815660 	758 	1 	34 	1 	154139.45 	1 	1 	1 	60728.89 	0
9299 	15815690 	614 	1 	40 	3 	113348.50 	1 	1 	1 	77789.01 	0

df.head()

user_id 	age 	job 	marital 	education 	default 	housing 	loan 	contact 	month 	day_of_week 	duration 	poutcome 	y
0 	BA2200001 	56 	housemaid 	married 	postgraduate 	no 		no 		no 	telephone 	may 	mon 	261 	nonexistent 	no
1 	BA2200077 	37 	services 	married 	high school 	no 		yes 	no 	telephone 	may 	mon 	226 	nonexistent 	no
2 	BA2200004 	40 	admin. 		married 	postgraduate 	no 		no 		no 	telephone 	may 	mon 	151 	nonexistent 	no
3 	BA2200005 	56 	services 	married 	high school 	no 		no 		yes telephone 	may 	mon 	307 	nonexistent 	no
4 	BA2200007 	59 	admin. 		married 	junior college 	no 		no 		no 	telephone 	may 	mon 	139 	nonexistent 	no

drop 函数

函数介绍：

df.drop(label=[1,2,3],axis=0) #删除多行数据，label传入行索引
df.drop(columns=['A','B']) #columns传入列名

实战代码:删除年龄中为-1的行

for i in list(data_train.index):
    if data_train['Age'][i] == '-1':
        data_train = data_train.drop([i])

drop_duplicates / duplicated 函数

df.user_id.duplicated() # 返回user_id列的布尔值，相同为true,不同为false
df.drop_duplicates(subset='user_id')  #删除user_id列的重复值

dropna 函数

删除这些指标里面有缺失值的行

df = df.dropna(axis=0,subset = ["job", "marital","education","default","housing","loan"])

value_counts 函数

df['default'].value_counts() #查看分布情况，相当于词频统计

cut 函数

bin:区间，左开右闭；label：各区间的标签；x对哪一列进行

df['age'] = pd.cut(x=df["age"],bins=[16,20,40,64,95],labels=["17-20","21-40","40-64","64-95"])

groupby 函数

grouby填入分组对象；agg填入分组方式，如count,mean,sum

df.groupby(['age','y_no']).agg('count')

get_dummies 函数

独热编码

#无顺序关系的指标采用label后用独热编码
object2=['job', 'marital', 'poutcome']
for j in object2:
    df=pd.get_dummies(df,columns=[j],prefix_sep='_')#名字通过_连接构造

isnull() 和notnul()

返回bool值，判断是否为空值

df = df[df['Exited'].notnull()] #取出Exited列非空的行

pivot_table 透视表

以Unnamed: 0这一列的数值分析，按照行Exited的分类情况和列Tenure的分类情况进行aggfunc统计

df3.pivot_table('Unnamed: 0',index='Exited',columns='Tenure',aggfunc='count')

apply和map函数

广播机制，与lambda函数搭配

df['user_id'] = df['user_id'].apply(lambda x:int(x[2:])) #从第二个位置开始截取并转为数字

a尼

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录