随机森林实现分类和回归问题

还不秃顶的计科生

于 2024-08-23 11:06:05 发布

阅读量471

点赞数 9

分类专栏：机器学习文章标签：分类回归数据挖掘

本文链接：https://blog.csdn.net/weixin_74009895/article/details/141452175

版权

机器学习专栏收录该内容

10 篇文章 0 订阅

订阅专栏

第一部分：基本含义

随机森林（Random Forest）是一种集成学习算法，主要用于分类和回归问题。它通过构建多个决策树并结合它们的结果，来提高模型的准确性和稳健性。具体用途包括：

回归：预测具体的数值（如房价、温度）。
分类：预测某个类别或事件发生的概率（如病人是否患病）。

第二部分：公式使用（分类）

在随机森林实现分类任务时，它的核心思想是通过构建多个决策树，并通过这些树的结果来做出最终的预测。每棵树给出一个预测结果，然后通过投票机制（即大多数决策树的结果）来确定最终的分类。

具体执行流程：

1. 数据准备：首先，需要有一组己标记的训练样本，每个样本都有一组特征和对应的类别标签。这些样本将被用来构建随机森林模型。

2. 随机抽样：对于训练样本，通过有放回的随机抽样（ bootstrap 抽样）生成多个不同的训练集。每个训练集都是通过从原始训练样本中抽取相同数量的样本得到的。

3. 构建决策树：对于每个训练集，使用决策树算法构建一个独立的决策树。

决策树的构建过程与普通的决策树分类算法相同，但在每个节点的特征选择时，只考虑一个随机选择的子集特征。

4. 集成决策：构建好多个决策树后，通过投票的方式来决定待分类样本的类别。对于分类问题，采用多数投票的方式，即将每个决策树的分类结果进行统计，选择得票最多的类别作为最终的分类结果。

5. 输出结果：将待分类样本归为具有最多投票的类别，作为最终的分类结果。

评价模型的标准：

第三部分：代码实现（分类）

（1）示例

现在我以"电信银行卡诈骗的数据分析"为例，通过使用“精确率”，“准确率”，“召回率”和“F1分数”来评估。

（2）代码

①训练模型

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load the dataset
data = pd.read_csv('电信银行卡诈骗的数据分析.csv')

# Split the data into features (X) and target (y)
X = data.drop(columns=['Fraud'])
y = data['Fraud']

# Split the data into training, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict on training, validation, and test sets
y_train_pred = rf_classifier.predict(X_train)
y_val_pred = rf_classifier.predict(X_val)
y_test_pred = rf_classifier.predict(X_test)

# Define a function to calculate metrics
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp)
    return accuracy, precision, recall, f1, specificity

# Metrics for training, validation, and test sets
train_metrics = calculate_metrics(y_train, y_train_pred)
val_metrics = calculate_metrics(y_val, y_val_pred)
test_metrics = calculate_metrics(y_test, y_test_pred)

# Display the results in a DataFrame
metrics_df = pd.DataFrame({
    'Dataset': ['Train', 'Validation', 'Test'],
    'Accuracy': [train_metrics[0], val_metrics[0], test_metrics[0]],
    'Precision': [train_metrics[1], val_metrics[1], test_metrics[1]],
    'Recall': [train_metrics[2], val_metrics[2], test_metrics[2]],
    'F1 Score': [train_metrics[3], val_metrics[3], test_metrics[3]],
    'Specificity': [train_metrics[4], val_metrics[4], test_metrics[4]]
})

# Display the DataFrame
print(metrics_df)

②测试案例


# ---- Testing on New Data ----
# Example of a new sample to test
new_sample = pd.DataFrame({
    'Distance1': [10],    # Replace with appropriate values
    'Distance2': [0.5],   # Replace with appropriate values
    'Ratio': [1.5],       # Replace with appropriate values
    'Repeat': [1],        # Replace with appropriate values
    'Card': [1],          # Replace with appropriate values
    'Pin': [0],           # Replace with appropriate values
    'Online': [1]         # Replace with appropriate values
})

# Predict the probability for the new sample
probabilities = rf_classifier.predict_proba(new_sample)

# Output the predicted class and probabilities
predicted_class = rf_classifier.predict(new_sample)
print(f"Predicted Class: {predicted_class[0]}")
print(f"Probabilities (class 0, class 1): {probabilities[0]}")

Predicted Class: 0: 这表示模型预测的新样本属于类别 0。根据你的问题和数据集，类别 0 可能对应于“非欺诈”（无欺诈行为），而类别 1 可能对应于“欺诈”（有欺诈行为）。
Probabilities (class 0, class 1): [1. 0.]:
- 1.0: 表示模型预测该样本属于 类别 0（即无欺诈）的概率为 1.0（或 100%）。
- 0.0: 表示模型预测该样本属于 类别 1（即欺诈）的概率为 0.0（或 0%）。

因此，[1.0, 0.0] 表示模型非常确定地认为这个样本属于 非欺诈 类别，且没有任何可能性属于欺诈类别。

汇总①+②：

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load the dataset
data = pd.read_csv('电信银行卡诈骗的数据分析.csv')

# Split the data into features (X) and target (y)
X = data.drop(columns=['Fraud'])
y = data['Fraud']

# Split the data into training, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict on training, validation, and test sets
y_train_pred = rf_classifier.predict(X_train)
y_val_pred = rf_classifier.predict(X_val)
y_test_pred = rf_classifier.predict(X_test)

# Define a function to calculate metrics
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp)
    return accuracy, precision, recall, f1, specificity

# Metrics for training, validation, and test sets
train_metrics = calculate_metrics(y_train, y_train_pred)
val_metrics = calculate_metrics(y_val, y_val_pred)
test_metrics = calculate_metrics(y_test, y_test_pred)

# Display the results in a DataFrame
metrics_df = pd.DataFrame({
    'Dataset': ['Train', 'Validation', 'Test'],
    'Accuracy': [train_metrics[0], val_metrics[0], test_metrics[0]],
    'Precision': [train_metrics[1], val_metrics[1], test_metrics[1]],
    'Recall': [train_metrics[2], val_metrics[2], test_metrics[2]],
    'F1 Score': [train_metrics[3], val_metrics[3], test_metrics[3]],
    'Specificity': [train_metrics[4], val_metrics[4], test_metrics[4]]
})

# Display the DataFrame
print(metrics_df)

# ---- Testing on New Data ----
# Example of a new sample to test
new_sample = pd.DataFrame({
    'Distance1': [10],    # Replace with appropriate values
    'Distance2': [0.5],   # Replace with appropriate values
    'Ratio': [1.5],       # Replace with appropriate values
    'Repeat': [1],        # Replace with appropriate values
    'Card': [1],          # Replace with appropriate values
    'Pin': [0],           # Replace with appropriate values
    'Online': [1]         # Replace with appropriate values
})

# Predict the probability for the new sample
probabilities = rf_classifier.predict_proba(new_sample)

# Output the predicted class and probabilities
predicted_class = rf_classifier.predict(new_sample)
print(f"Predicted Class: {predicted_class[0]}")
print(f"Probabilities (class 0, class 1): {probabilities[0]}")

③升级版-将训练好的模型输出，以便后续直接调用，而不需重复训练

将模型输出保存为:""random_forest_model.pkl"

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import joblib  # For saving and loading the model

# Load the dataset
data = pd.read_csv('电信银行卡诈骗的数据分析.csv')

# Split the data into features (X) and target (y)
X = data.drop(columns=['Fraud'])
y = data['Fraud']

# Split the data into training, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Save the trained model to a file
model_filename = "random_forest_model.pkl"
joblib.dump(rf_classifier, model_filename)

# ---- Optional: You can print or calculate model metrics if necessary ----
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp)
    return accuracy, precision, recall, f1, specificity

# Predict and calculate metrics
y_train_pred = rf_classifier.predict(X_train)
train_metrics = calculate_metrics(y_train, y_train_pred)
print("Training Set Metrics:", train_metrics)

# Model is saved; ready for later use

④升级版-调用训练好的模型进行测试

import joblib
import pandas as pd

# Load the trained model from the file
model_filename = "random_forest_model.pkl"
rf_classifier = joblib.load(model_filename)

# ---- Testing on New Data ----
# Example of a new sample to test
new_sample = pd.DataFrame({
    'Distance1': [10],    # Replace with appropriate values
    'Distance2': [0.5],   # Replace with appropriate values
    'Ratio': [1.5],       # Replace with appropriate values
    'Repeat': [1],        # Replace with appropriate values
    'Card': [1],          # Replace with appropriate values
    'Pin': [0],           # Replace with appropriate values
    'Online': [1]         # Replace with appropriate values
})

# Predict the probability for the new sample
probabilities = rf_classifier.predict_proba(new_sample)

# Output the predicted class and probabilities
predicted_class = rf_classifier.predict(new_sample)
print(f"Predicted Class: {predicted_class[0]}")
print(f"Probabilities (class 0, class 1): {probabilities[0]}")