关于Titanic预测是否幸存,是kaggle上最简单的一个竞赛,而如下就是我对于这个竞赛所写的代码,它的流程很简单,所以并不需要用流程图解释。
代码如下:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from MachineLearning import Flu # 调用预测DrivenData上预测疫苗接种的一份代码
# 导入数据
data = pd.read_csv(r"C:\Users\20349\Desktop\ArtificialIntelligence\kaggle\titanic\train.csv")# 要预测的数据集
pre = pd.read_csv(r"C:\Users\20349\Desktop\ArtificialIntelligence\kaggle\titanic\test.csv")
# 分离特征和目标变量
df = pd.DataFrame(data)
X = df.drop(columns=['Survived'])
y = df['Survived']
#print(X)
#print(y)
# 处理缺失值
X.fillna(X.mode().iloc[0], inplace=True)
pre.fillna(pre.mode().iloc[0], inplace=True)
#print(pre.iloc[0,:])
# 进行OneHot编码
categorical_features = [
'Sex', 'Embarked', 'Pclass', 'Cabin'
]
numeric_features = [
'Age', 'SibSp', 'Parch', 'Fare'
]
X_cate = X[categorical_features]
pre_cate = pre[categorical_features]
X_num = X[numeric_features]
pre_num = pre[numeric_features]
# 初始化
encoder = OneHotEncoder(sparse_output=False, drop='if_binary', handle_unknown='ignore')
# 拟合并转化
X_cate_encoder = encoder.fit_transform(X_cate)# ①
pre_cate_encoder = encoder.transform(pre_cate)# ②
# 获取编码后的名称
feature_name = encoder.get_feature_names_out(categorical_features)
# 转datafrom
X_cate_df = pd.DataFrame(X_cate_encoder, columns=feature_name)
pre_cate_df = pd.DataFrame(pre_cate_encoder, columns=feature_name)
# 整合
X_encoder = pd.concat([X_num, X_cate_df], axis=1)
pre_encoder = pd.concat([pre_num, pre_cate_df], axis=1)
# 归一化
scaler = StandardScaler()
X_num_scaler = scaler.fit_transform(X_num)
X_encoder = pd.concat([pd.DataFrame(X_num_scaler, columns=numeric_features), X_cate_df], axis=1)
# 预测
result = Flu.pre_logistic(X_encoder, y, pre_encoder)
#coefficients = model.coef_[0] # 逻辑回归只有一个系数向量
#intercept = model.intercept_[0]
#print("Model function: logit(p) =", ' + '.join([f'{c:.2f} * x{i}' for i, c in enumerate(coefficients)]), f'+ {intercept:.2f}')
# 写入文件
with open(r"C:\Users\20349\Desktop\ArtificialIntelligence\kaggle\titanic\submission.csv",'w',
encoding='utf-8') as file:
file.write("PassengerId,Survived\n")
for i, prediction in enumerate(result, start=892):
file.write(f"{i},{int(prediction)}\n")
print("Finished!!!")
#feature
#PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
#PassengerId, Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
在代码中,我使用了我自学的一本书上的一个生成热图的函数来帮助我观察不同条件下预测的情况。
在代码中,我们还会发现我用了一个我自定义的函数,这个函数是我在DrivenData上预测疫苗接种所用的一个逻辑回归的函数。关于它的代码如下:
def pre_logistic(X_encoded, y, submission_data):
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y,
test_size=0.2, random_state=42)
'''import model and predict'''
# 调整 C 参数
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'solver': ['liblinear', 'lbfgs']}
# 网格搜索
grid_search = GridSearchCV(LogisticRegression(random_state=42,
max_iter=1000), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
# 输出最佳参数
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
# 使用最佳参数重新训练模型
best_model = grid_search.best_estimator_
# 热图
result = pd.DataFrame(grid_search.cv_results_)
display(result.head())
scores = np.array(result.mean_test_score).reshape(len(param_grid['solver']), len(param_grid['C']))
mglearn.tools.heatmap(scores,xlabel='C',xticklabels=param_grid['C'],
ylabel='solver',yticklabels=param_grid['solver'],cmap="viridis")
# 预测
y_pred_best = best_model.predict(X_test)
pre_result_best = best_model.predict(submission_data).astype('float64')
# 评估
# 计算准确率
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Best Tuned Accuracy: {accuracy_best:.2f}")
# 打印分类报告
report_best = classification_report(y_test, y_pred_best)
print("Best Classification Report:")
print(report_best)
# 打印混淆矩阵
conf_matrix_best = confusion_matrix(y_test, y_pred_best)
print("Best Confusion Matrix:")
print(conf_matrix_best)
plt.show()
# 返回对应预测结果的数组
return pre_result_best
其中,生成的如图如下:
其中在我进行预测时,出现过一个问题,就是我标注了①与②的那两行里,如果是代码中那样后续就可以正常进行,而如果改为:
X_cate_encoder = encoder.fit_transform(X_cate)# ①
pre_cate_encoder = encoder.fit_transform(pre_cate)# ②
就会出现这样的一个报错:
ValueError: Shape of passed values is (891, 154), indices imply (891, 83)
关于这个问题的答案其实是这样的:
当在训练集 X_cate
上使用 fit_transform
时,编码器学习了训练集中的类别。
当在测试集 pre_cate
上也使用 fit_transform
时,编码器可能会学习到不同的类别,从而导致编码不一致。
此上