航司满意度预测项目-CSDN博客

本文链接：https://blog.csdn.net/qq_39297053/article/details/137208462

本文探讨了如何通过运用随机森林、支持向量机、决策树等机器学习模型，分析航空公司乘客满意度的影响因素，以提升服务质量。项目使用Python库pandas和scikit-learn进行数据预处理和模型训练，结果显示随机森林分类器表现最佳。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

注意：本文引用自专业人工智能社区Venus AI

问题陈述：

航空公司乘客满意度
有很多因素会影响企业的生存能力，从竞争力到声誉和客户满意度。
本研究的目的是确定乘客的满意度水平，了解航空公司提供的服务质量、获得客户满意度的关键因素，并确定航空业如何提高服务质量。

项目使用模型与依赖库：

随机森林分类器
支持向量机
决策树分类器
K邻居分类器
高斯朴素贝叶斯

开发项目时使用的库：

pandas==2.0.2
scikit_learn==1.2.2

项目结构

1）首先导入所有库
2）从Excel文件中读取训练/测试数据
3) 数据分析
4）数据清洗/预处理
5）模型训练
6）模型评估

结论：我们使用以下模型：随机森林分类器、支持向量机、决策树分类器、KNeighbors 分类器、高斯朴素贝叶斯。此随机森林分类器最适合此数据集。

项目结论

图片[1]-航司满意度预测项目-VenusAI

项目详情

导入必要的库

import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

读取数据

data=pd.read_csv('train.csv')

data.head()

5 rows × 25 columns

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      103904 non-null  int64  
 12  Food and drink                     103904 non-null  int64  
 13  Online boarding                    103904 non-null  int64  
 14  Seat comfort                       103904 non-null  int64  
 15  Inflight entertainment             103904 non-null  int64  
 16  On-board service                   103904 non-null  int64  
 17  Leg room service                   103904 non-null  int64  
 18  Baggage handling                   103904 non-null  int64  
 19  Checkin service                    103904 non-null  int64  
 20  Inflight service                   103904 non-null  int64  
 21  Cleanliness                        103904 non-null  int64  
 22  Departure Delay in Minutes         103904 non-null  int64  
 23  Arrival Delay in Minutes           103594 non-null  float64
 24  satisfaction                       103904 non-null  object 
dtypes: float64(1), int64(19), object(5)
memory usage: 19.8+ MB

data.shape

(103904, 25)

data.describe()

数据清洗

data.isna().sum()

Unnamed: 0                             0
id                                     0
Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Inflight wifi service                  0
Departure/Arrival time convenient      0
Ease of Online booking                 0
Gate location                          0
Food and drink                         0
Online boarding                        0
Seat comfort                           0
Inflight entertainment                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Inflight service                       0
Cleanliness                            0
Departure Delay in Minutes             0
Arrival Delay in Minutes             310
satisfaction                           0
dtype: int64

data.dropna(axis=0, inplace=True)

data.isna().sum()

Unnamed: 0                           0
id                                   0
Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Inflight wifi service                0
Departure/Arrival time convenient    0
Ease of Online booking               0
Gate location                        0
Food and drink                       0
Online boarding                      0
Seat comfort                         0
Inflight entertainment               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Inflight service                     0
Cleanliness                          0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
satisfaction                         0
dtype: int64

### Encoding ###
le = LabelEncoder()
data["Gender"] = le.fit_transform(data["Gender"])
data["Customer Type"] = le.fit_transform(data["Customer Type"])
data["Type of Travel"] = le.fit_transform(data["Type of Travel"])
data["satisfaction"] = le.fit_transform(data["satisfaction"])

### Labeling ###
data["Class"] = data["Class"].replace({"Eco":1,"Eco Plus":2,"Business":3})

实例化LabelEncoder：

le = LabelEncoder()

这一步创建了一个LabelEncoder对象le，用于后续的标签编码。

对各个特征进行编码：

对于数据集data中的每个分类特征，LabelEncoder的fit_transform方法被用来转换其值。

data["Gender"] = le.fit_transform(data["Gender"])：对Gender特征进行编码，将文本标签（如"Male"，"Female"）转换为数字（如0, 1）。

data["Customer Type"] = le.fit_transform(data["Customer Type"])：对Customer Type特征进行同样的处理。

data["Type of Travel"] = le.fit_transform(data["Type of Travel"])：对Type of Travel特征进行编码。

data["satisfaction"] = le.fit_transform(data["satisfaction"])：对satisfaction特征进行编码。

fit_transform方法首先将标签拟合到数据上，然后将它们转换为适当的数值标签。

手动标签替换

除了使用LabelEncoder，代码还手动替换了Class特征中的标签。

data["Class"] = data["Class"].replace({"Eco":1,"Eco Plus":2,"Business":3})

这一行代码将Class特征中的每个类别（"Eco", "Eco Plus", "Business"）映射到一个具体的数字（1, 2, 3）。这是一种更直接的方法来进行类别编码，特别是当类别的数量不多，且您希望指定特定的数值时。

模型训练

X_train = data.drop("satisfaction", axis=1)
y_train= data["satisfaction"]
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
models = pd.DataFrame(columns=["Model Name","Accuracy Score"])

model_list = [("Random Forest Classifier",RandomForestClassifier(random_state=42)),
             ("Support Vector Machines",SVC(random_state=42)),
             ("Decision Tree Classifier", DecisionTreeClassifier(random_state=42)),
             ("KNeighbors Classifier",KNeighborsClassifier(n_neighbors=2)),
             ("Gaussian Naive Bayes", GaussianNB())]

模型验证

testData=pd.read_csv('test.csv')
testData.dropna(axis=0, inplace=True)

### Encoding ###
le = LabelEncoder()
testData["Gender"] = le.fit_transform(testData["Gender"])
testData["Customer Type"] = le.fit_transform(testData["Customer Type"])
testData["Type of Travel"] = le.fit_transform(testData["Type of Travel"])
testData["satisfaction"] = le.fit_transform(testData["satisfaction"])

### Labeling ###
testData["Class"] = testData["Class"].replace({"Eco":1,"Eco Plus":2,"Business":3})

Xtest = testData.drop("satisfaction", axis=1)
ytest= testData["satisfaction"]
Xtest = scaler.fit_transform(Xtest)

# 创建一个空的DataFrame来存储结果
models = pd.DataFrame(columns=["Model Name", "Accuracy Score"])

# 循环遍历模型
for algoName, model in model_list:
    model.fit(X_train, y_train)
    predictions = model.predict(Xtest)
    score = accuracy_score(ytest, predictions)
    new_row = {"Model Name": algoName, "Accuracy Score": score}

    # 使用 pd.concat 而不是 append
    models = pd.concat([models, pd.DataFrame([new_row])], ignore_index=True)

# 对模型按照准确率降序排列
models = models.sort_values(by="Accuracy Score", ascending=False)

# 显示模型及其准确率
models

	Model Name	Accuracy Score
0	Random Forest Classifier	0.963890
1	Support Vector Machines	0.956050
2	Decision Tree Classifier	0.945700
3	KNeighbors Classifier	0.909319
4	Gaussian Naive Bayes	0.861970