离散特征的处理
今天的任务分成以下几步
1. 读取数据
2. 找到所有离散特征
3. 选择一个离散特征进行独热编码
4. 采取循环对所有离散特征进行独热编码
5. 加上昨天的内容 并且处理所有缺失值
今天使用py文件,debugger功能运用不是熟练。
输入:
import pandas as pd #读取数据
data = pd.read_csv('data.csv') #此时data是一个DataFrame对象
discrete_lists = [] # 新建一个空列表,用于存放离散变量名
# 打印所有的离散变量名
for discrete_features in data.columns:
if data[discrete_features].dtype == 'object':
discrete_lists.append(discrete_features)
print(discrete_lists)#打印离散变量名
# 对离散变量进行独热编码
# 这里的drop_first=True表示删除第一个特征,避免出现共线性
# 这里的data是一个DataFrame对象,columns是一个列表,存放离散变量名
data = pd.get_dummies(data, columns=discrete_lists, drop_first=True)
data2 = pd.read_csv("data.csv") #重新读取数据
list_final = [] # 新建一个空列表,用于存放独热编码后新增的特征名
for i in data.columns:
if i not in data2.columns:
list_final.append(i)
print(list_final)#打印新增的特征名
# 现在将新增的特征名转换为int类型
for i in list_final:
data[i] = data[i].astype(int) # 这里的i就是独热编码后的特征名
print(data.isnull().sum()) # 统计每一列的缺失值个数
# 用均值填补
# 循环遍历这个列表中的每一列
for i in data.columns:
if data[i].isnull().sum() > 0: # 找到存在缺失值的列
#计算该列的均值
mean_value = data[i].mean()
#用均值填充缺失值
data[i].fillna(mean_value, inplace=True)
print(data.isnull().sum())
输出:
['Home Ownership', 'Years in current job', 'Purpose', 'Term']
['Home Ownership_Home Mortgage', 'Home Ownership_Own Home', 'Home Ownership_Rent', 'Years in current job_10+ years', 'Years in current job_2 years', 'Years in current job_3 years', 'Years in current job_4 years', 'Years in current job_5 years', 'Years in current job_6 years', 'Years in current job_7 years', 'Years in current job_8 years', 'Years in current job_9 years', 'Years in current job_< 1 year', 'Purpose_buy a car', 'Purpose_buy house', 'Purpose_debt consolidation', 'Purpose_educational expenses', 'Purpose_home improvements', 'Purpose_major purchase', 'Purpose_medical bills', 'Purpose_moving', 'Purpose_other', 'Purpose_renewable energy', 'Purpose_small business', 'Purpose_take a trip', 'Purpose_vacation', 'Purpose_wedding', 'Term_Short Term']
Id 0
Annual Income 1557
Tax Liens 0
Number of Open Accounts 0
Years of Credit History 0
Maximum Open Credit 0
Number of Credit Problems 0
Months since last delinquent 4081
Bankruptcies 14
Current Loan Amount 0
Current Credit Balance 0
Monthly Debt 0
Credit Score 1557
Credit Default 0
Home Ownership_Home Mortgage 0
Home Ownership_Own Home 0
Home Ownership_Rent 0
Years in current job_10+ years 0
Years in current job_2 years 0
Years in current job_3 years 0
Years in current job_4 years 0
Years in current job_5 years 0
Years in current job_6 years 0
Years in current job_7 years 0
Years in current job_8 years 0
Years in current job_9 years 0
Years in current job_< 1 year 0
Purpose_buy a car 0
Purpose_buy house 0
Purpose_debt consolidation 0
Purpose_educational expenses 0
Purpose_home improvements 0
Purpose_major purchase 0
Purpose_medical bills 0
Purpose_moving 0
Purpose_other 0
Purpose_renewable energy 0
Purpose_small business 0
Purpose_take a trip 0
Purpose_vacation 0
Purpose_wedding 0
Term_Short Term 0
dtype: int64
d:\Python\代码\python60-days-challenge-4c74c320ea49739f5c13fd7caa1aed012d629c6c\python60-days-challenge-4c74c320ea49739f5c13fd7caa1aed012d629c6c\FiveDay.py:30: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
data[i].fillna(mean_value, inplace=True)
Id 0
Annual Income 0
Tax Liens 0
Number of Open Accounts 0
Years of Credit History 0
Maximum Open Credit 0
Number of Credit Problems 0
Months since last delinquent 0
Bankruptcies 0
Current Loan Amount 0
Current Credit Balance 0
Monthly Debt 0
Credit Score 0
Credit Default 0
Home Ownership_Home Mortgage 0
Home Ownership_Own Home 0
Home Ownership_Rent 0
Years in current job_10+ years 0
Years in current job_2 years 0
Years in current job_3 years 0
Years in current job_4 years 0
Years in current job_5 years 0
Years in current job_6 years 0
Years in current job_7 years 0
Years in current job_8 years 0
Years in current job_9 years 0
Years in current job_< 1 year 0
Purpose_buy a car 0
Purpose_buy house 0
Purpose_debt consolidation 0
Purpose_educational expenses 0
Purpose_home improvements 0
Purpose_major purchase 0
Purpose_medical bills 0
Purpose_moving 0
Purpose_other 0
Purpose_renewable energy 0
Purpose_small business 0
Purpose_take a trip 0
Purpose_vacation 0
Purpose_wedding 0
Term_Short Term 0
dtype: int64
备注:昨天警告依然存在。