注意:本文引用自专业人工智能社区Venus AI
更多AI知识请参考原站 ([www.aideeplearning.cn])
问题陈述:
航空公司乘客满意度
有很多因素会影响企业的生存能力,从竞争力到声誉和客户满意度。
本研究的目的是确定乘客的满意度水平,了解航空公司提供的服务质量、获得客户满意度的关键因素,并确定航空业如何提高服务质量。
项目使用模型与依赖库:
- 随机森林分类器
- 支持向量机
- 决策树分类器
- K邻居分类器
- 高斯朴素贝叶斯
开发项目时使用的库:
pandas==2.0.2
scikit_learn==1.2.2
项目结构
1)首先导入所有库
2)从Excel文件中读取训练/测试数据
3) 数据分析
4)数据清洗/预处理
5)模型训练
6)模型评估
结论:我们使用以下模型:随机森林分类器、支持向量机、决策树分类器、KNeighbors 分类器、高斯朴素贝叶斯。 此随机森林分类器最适合此数据集。
项目结论
项目详情
导入必要的库
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
读取数据
data=pd.read_csv('train.csv')
data.head()
5 rows × 25 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 103904 entries, 0 to 103903 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 103904 non-null int64 1 id 103904 non-null int64 2 Gender 103904 non-null object 3 Customer Type 103904 non-null object 4 Age 103904 non-null int64 5 Type of Travel 103904 non-null object 6 Class 103904 non-null object 7 Flight Distance 103904 non-null int64 8 Inflight wifi service 103904 non-null int64 9 Departure/Arrival time convenient 103904 non-null int64 10 Ease of Online booking 103904 non-null int64 11 Gate location 103904 non-null int64 12 Food and drink 103904 non-null int64 13 Online boarding 103904 non-null int64 14 Seat comfort 103904 non-null int64 15 Inflight entertainment 103904 non-null int64 16 On-board service 103904 non-null int64 17 Leg room service 103904 non-null int64 18 Baggage handling 103904 non-null int64 19 Checkin service 103904 non-null int64 20 Inflight service 103904 non-null int64 21 Cleanliness 103904 non-null int64 22 Departure Delay in Minutes 103904 non-null int64 23 Arrival Delay in Minutes 103594 non-null float64 24 satisfaction 103904 non-null object dtypes: float64(1), int64(19), object(5) memory usage: 19.8+ MB
data.shape
(103904, 25)
data.describe()
数据清洗
data.isna().sum()
Unnamed: 0 0 id 0 Gender 0 Customer Type 0 Age 0 Type of Travel 0 Class 0 Flight Distance 0 Inflight wifi service 0 Departure/Arrival time convenient 0 Ease of Online booking 0 Gate location 0 Food and drink 0 Online boarding 0 Seat comfort 0 Inflight entertainment 0 On-board service 0 Leg room service 0 Baggage handling 0 Checkin service 0 Inflight service 0 Cleanliness 0 Departure Delay in Minutes 0 Arrival Delay in Minutes 310 satisfaction 0 dtype: int64
data.dropna(axis=0, inplace=True)
data.isna().sum()
Unnamed: 0 0 id 0 Gender 0 Customer Type 0 Age 0 Type of Travel 0 Class 0 Flight Distance 0 Inflight wifi service 0 Departure/Arrival time convenient 0 Ease of Online booking 0 Gate location 0 Food and drink 0 Online boarding 0 Seat comfort 0 Inflight entertainment 0 On-board service 0 Leg room service 0 Baggage handling 0 Checkin service 0 Inflight service 0 Cleanliness 0 Departure Delay in Minutes 0 Arrival Delay in Minutes 0 satisfaction 0 dtype: int64
### Encoding ###
le = LabelEncoder()
data["Gender"] = le.fit_transform(data["Gender"])
data["Customer Type"] = le.fit_transform(data["Customer Type"])
data["Type of Travel"] = le.fit_transform(data["Type of Travel"])
data["satisfaction"] = le.fit_transform(data["satisfaction"])
### Labeling ###
data["Class"] = data["Class"].replace({"Eco":1,"Eco Plus":2,"Business":3})
实例化LabelEncoder:
le = LabelEncoder()
这一步创建了一个LabelEncoder对象le,用于后续的标签编码。
对各个特征进行编码:
对于数据集data中的每个分类特征,LabelEncoder的fit_transform方法被用来转换其值。
data["Gender"] = le.fit_transform(data["Gender"]):对Gender特征进行编码,将文本标签(如"Male","Female")转换为数字(如0, 1)。
data["Customer Type"] = le.fit_transform(data["Customer Type"]):对Customer Type特征进行同样的处理。
data["Type of Travel"] = le.fit_transform(data["Type of Travel"]):对Type of Travel特征进行编码。
data["satisfaction"] = le.fit_transform(data["satisfaction"]):对satisfaction特征进行编码。
fit_transform方法首先将标签拟合到数据上,然后将它们转换为适当的数值标签。
手动标签替换
除了使用LabelEncoder,代码还手动替换了Class特征中的标签。
data["Class"] = data["Class"].replace({"Eco":1,"Eco Plus":2,"Business":3})
这一行代码将Class特征中的每个类别("Eco", "Eco Plus", "Business")映射到一个具体的数字(1, 2, 3)。这是一种更直接的方法来进行类别编码,特别是当类别的数量不多,且您希望指定特定的数值时。
模型训练
X_train = data.drop("satisfaction", axis=1)
y_train= data["satisfaction"]
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
models = pd.DataFrame(columns=["Model Name","Accuracy Score"])
model_list = [("Random Forest Classifier",RandomForestClassifier(random_state=42)),
("Support Vector Machines",SVC(random_state=42)),
("Decision Tree Classifier", DecisionTreeClassifier(random_state=42)),
("KNeighbors Classifier",KNeighborsClassifier(n_neighbors=2)),
("Gaussian Naive Bayes", GaussianNB())]
模型验证
testData=pd.read_csv('test.csv')
testData.dropna(axis=0, inplace=True)
### Encoding ###
le = LabelEncoder()
testData["Gender"] = le.fit_transform(testData["Gender"])
testData["Customer Type"] = le.fit_transform(testData["Customer Type"])
testData["Type of Travel"] = le.fit_transform(testData["Type of Travel"])
testData["satisfaction"] = le.fit_transform(testData["satisfaction"])
### Labeling ###
testData["Class"] = testData["Class"].replace({"Eco":1,"Eco Plus":2,"Business":3})
Xtest = testData.drop("satisfaction", axis=1)
ytest= testData["satisfaction"]
Xtest = scaler.fit_transform(Xtest)
# 创建一个空的DataFrame来存储结果
models = pd.DataFrame(columns=["Model Name", "Accuracy Score"])
# 循环遍历模型
for algoName, model in model_list:
model.fit(X_train, y_train)
predictions = model.predict(Xtest)
score = accuracy_score(ytest, predictions)
new_row = {"Model Name": algoName, "Accuracy Score": score}
# 使用 pd.concat 而不是 append
models = pd.concat([models, pd.DataFrame([new_row])], ignore_index=True)
# 对模型按照准确率降序排列
models = models.sort_values(by="Accuracy Score", ascending=False)
# 显示模型及其准确率
models
Model Name | Accuracy Score | |
---|---|---|
0 | Random Forest Classifier | 0.963890 |
1 | Support Vector Machines | 0.956050 |
2 | Decision Tree Classifier | 0.945700 |
3 | KNeighbors Classifier | 0.909319 |
4 | Gaussian Naive Bayes | 0.861970 |