机器学习笔记

费克纽斯-耐克斯特

已于 2022-09-23 11:04:05 修改

阅读量1.3k

点赞数 1

分类专栏：笔记文章标签：机器学习 python pandas

于 2022-09-16 11:46:39 首次发布

本文链接：https://blog.csdn.net/fakenews_next/article/details/126699324

版权

笔记专栏收录该内容

22 篇文章 0 订阅

订阅专栏

文件读取

csv文件

#读取train.csv文件
import pandas as pd
import numpy as np
data =pd.read_csv("C:\\Users\\Administrator\\Desktop\\train.csv",index_col ='Loan_ID')
#指定索引列 index_col

excel文件

可以指定表单

data.to_excel('train.xlsx')#将文件写出为excel的格式
data2 = pd.read_excel('train.xlsx',sheet_name='Sheet1')

编码错误

encoding = ‘gbk’ 或者 encoding = ‘UTF-8’

pandas数据类型

dataframe

行 index
列 columns

数据选择

loc

data.loc[(data['Education']=='Not Graduate') & (data['Loan_Status']=='Y') & (data['Gender']=='Female'),['Gender','Education','Loan_Status'] ]

限制性别，是否毕业，贷款状态的列表

遍历dataframe

apply + 自定义函数

缺失值检查

# 使用apply对数据集应用自定义函数
def num_missing(x):
    return sum(x.isnull())
# 使用apply函数将num_missing函数用于统计数据集的每列缺失值数量
#  axis=0为列  axis=1为行
print(data.apply(num_missing,axis=0))

在这里插入图片描述

iterrows

iterrows() 是在数据框中的行进行迭代的一个生成器，它返回每行的索引及一个包含行本身的对象。

for index,row in data.iterrows():
	print(index)
	print(row['Gender'])

缺失值填充

对于Gender、Married、Self_Employed三个因子型变量，使用各自最常见的因子进行缺失值填充

data['Gender'].fillna(data['Gender'].mode().iloc[0],inplace=True)
data['Married'].fillna(data['Married'].mode().iloc[0],inplace=True)
data['Self_Employed'].fillna(data['Self_Employed'].mode().iloc[0],inplace=True)

对于LoanAmount变量进行缺失值填充处理
按照“Gender”、“Married”及“Self_Employed”的组合下的每个组群进行LoanAmount变量的均值统计
按照每组统计得到的平均值，对“LoanAmount”中缺失值进行填充

impute_grps = data.pivot_table(values=["LoanAmount"],index=["Gender","Married","Self_Employed"],aggfunc=np.mean)
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
    ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
    data.loc[i,'LoanAmount']=impute_grps.loc[ind].values[0]

统计分析

min max mean

series,Dataframe

mode

series类型
iloc获取值

describe

corr（相关性矩阵）

分组统计/数据透视表

pivot_table

values

列表，元素为列名

index

列表，元素是列名

aggfunc

字典，元素“列名”：函数[函数1，函数2]

样例代码

impute_grps = data.pivot_table(values=["LoanAmount"."App icantincome"],
                               Index=["Gender","Maried","Self_employed"],
                               aggfunc={"LoanAmount":np.mean,"ApplicantIncome":[np.sum,np.mean]})
impute_grps

groupby

groupby(by=‘列名’)
返回一个group类型

groupby().agg(‘列1’:‘mean’,‘列2’:‘max’)
返回dataframe

groupby('company').agg('salary':'median','age':'mean')

groupby().apply()
传递给自定义函数的是dataframe

def func1(x):
	print(type(x))
	print(x)
groupby().apply(func1)

crosstab

列Series，可以用列表保存多列
margins 是否统计总数
nomalize 是否求比例

pd.crosstab(data("Credit_History"),[data['Gender'],data['Loan_Status'],margins=True,normalize=True)

合并数据集

merge

1.right=
2.join
3.left_on/right_on
列名是否要一致？否
列的内容是可以否定
4.
left_index/right_index
True/False

data_merged = data.merge(right=prop_rates,how='inner',left_on='Property_Area',right_index=True,sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'],aggfunc=len)