随机森林实现分类和回归问题

第一部分:基本含义

随机森林(Random Forest)是一种集成学习算法,主要用于分类回归问题。它通过构建多个决策树并结合它们的结果,来提高模型的准确性和稳健性。具体用途包括:

  • 回归:预测具体的数值(如房价、温度)。
  • 分类:预测某个类别或事件发生的概率(如病人是否患病)。

第二部分:公式使用(分类)

在随机森林实现分类任务时,它的核心思想是通过构建多个决策树,并通过这些树的结果来做出最终的预测。每棵树给出一个预测结果,然后通过投票机制(即大多数决策树的结果)来确定最终的分类。

具体执行流程:

1. 数据准备:首先,需要有一组己标记的训练样本,每个样本都有一组特征和对应的类别标签。这些样本将被用来构建随机森林模型。
2. 随机抽样:对于训练样本,通过有放回的随机抽样( bootstrap 抽样)生成多个不同的训练集。每个训练集都是通过从原始训练样本中抽取相同数量的样本得到的。
3. 构建决策树:对于每个训练集,使用决策树算法构建一个独立的决策树。
决策树的构建过程与普通的决策树分类算法相同,但在每个节点的特征选择时,只考虑一个随机选择的子集特征。
4. 集成决策:构建好多个决策树后,通过投票的方式来决定待分类样本的类别。对于分类问题,采用多数投票的方式,即将每个决策树的分类结果进行统计,选择得票最多的类别作 为最终的分类结果。
5. 输出结果:将待分类样本归为具有最多投票的类别,作为最终的分类结果。
评价模型的标准:

第三部分:代码实现(分类)

(1)示例

现在我以"电信银行卡诈骗的数据分析"为例,通过使用“精确率”,“准确率”,“召回率”和“F1分数”来评估。

(2)代码

①训练模型

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load the dataset
data = pd.read_csv('电信银行卡诈骗的数据分析.csv')

# Split the data into features (X) and target (y)
X = data.drop(columns=['Fraud'])
y = data['Fraud']

# Split the data into training, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict on training, validation, and test sets
y_train_pred = rf_classifier.predict(X_train)
y_val_pred = rf_classifier.predict(X_val)
y_test_pred = rf_classifier.predict(X_test)

# Define a function to calculate metrics
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp)
    return accuracy, precision, recall, f1, specificity

# Metrics for training, validation, and test sets
train_metrics = calculate_metrics(y_train, y_train_pred)
val_metrics = calculate_metrics(y_val, y_val_pred)
test_metrics = calculate_metrics(y_test, y_test_pred)

# Display the results in a DataFrame
metrics_df = pd.DataFrame({
    'Dataset': ['Train', 'Validation', 'Test'],
    'Accuracy': [train_metrics[0], val_metrics[0], test_metrics[0]],
    'Precision': [train_metrics[1], val_metrics[1], test_metrics[1]],
    'Recall': [train_metrics[2], val_metrics[2], test_metrics[2]],
    'F1 Score': [train_metrics[3], val_metrics[3], test_metrics[3]],
    'Specificity': [train_metrics[4], val_metrics[4], test_metrics[4]]
})

# Display the DataFrame
print(metrics_df)

②测试案例


# ---- Testing on New Data ----
# Example of a new sample to test
new_sample = pd.DataFrame({
    'Distance1': [10],    # Replace with appropriate values
    'Distance2': [0.5],   # Replace with appropriate values
    'Ratio': [1.5],       # Replace with appropriate values
    'Repeat': [1],        # Replace with appropriate values
    'Card': [1],          # Replace with appropriate values
    'Pin': [0],           # Replace with appropriate values
    'Online': [1]         # Replace with appropriate values
})

# Predict the probability for the new sample
probabilities = rf_classifier.predict_proba(new_sample)

# Output the predicted class and probabilities
predicted_class = rf_classifier.predict(new_sample)
print(f"Predicted Class: {predicted_class[0]}")
print(f"Probabilities (class 0, class 1): {probabilities[0]}")

  • Predicted Class: 0: 这表示模型预测的新样本属于类别 0。根据你的问题和数据集,类别 0 可能对应于“非欺诈”(无欺诈行为),而类别 1 可能对应于“欺诈”(有欺诈行为)。

  • Probabilities (class 0, class 1): [1. 0.]:

    • 1.0: 表示模型预测该样本属于 类别 0(即无欺诈)的概率为 1.0(或 100%)。
    • 0.0: 表示模型预测该样本属于 类别 1(即欺诈)的概率为 0.0(或 0%)。

因此,[1.0, 0.0] 表示模型非常确定地认为这个样本属于 非欺诈 类别,且没有任何可能性属于 欺诈 类别。

汇总①+②:

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load the dataset
data = pd.read_csv('电信银行卡诈骗的数据分析.csv')

# Split the data into features (X) and target (y)
X = data.drop(columns=['Fraud'])
y = data['Fraud']

# Split the data into training, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Predict on training, validation, and test sets
y_train_pred = rf_classifier.predict(X_train)
y_val_pred = rf_classifier.predict(X_val)
y_test_pred = rf_classifier.predict(X_test)

# Define a function to calculate metrics
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp)
    return accuracy, precision, recall, f1, specificity

# Metrics for training, validation, and test sets
train_metrics = calculate_metrics(y_train, y_train_pred)
val_metrics = calculate_metrics(y_val, y_val_pred)
test_metrics = calculate_metrics(y_test, y_test_pred)

# Display the results in a DataFrame
metrics_df = pd.DataFrame({
    'Dataset': ['Train', 'Validation', 'Test'],
    'Accuracy': [train_metrics[0], val_metrics[0], test_metrics[0]],
    'Precision': [train_metrics[1], val_metrics[1], test_metrics[1]],
    'Recall': [train_metrics[2], val_metrics[2], test_metrics[2]],
    'F1 Score': [train_metrics[3], val_metrics[3], test_metrics[3]],
    'Specificity': [train_metrics[4], val_metrics[4], test_metrics[4]]
})

# Display the DataFrame
print(metrics_df)

# ---- Testing on New Data ----
# Example of a new sample to test
new_sample = pd.DataFrame({
    'Distance1': [10],    # Replace with appropriate values
    'Distance2': [0.5],   # Replace with appropriate values
    'Ratio': [1.5],       # Replace with appropriate values
    'Repeat': [1],        # Replace with appropriate values
    'Card': [1],          # Replace with appropriate values
    'Pin': [0],           # Replace with appropriate values
    'Online': [1]         # Replace with appropriate values
})

# Predict the probability for the new sample
probabilities = rf_classifier.predict_proba(new_sample)

# Output the predicted class and probabilities
predicted_class = rf_classifier.predict(new_sample)
print(f"Predicted Class: {predicted_class[0]}")
print(f"Probabilities (class 0, class 1): {probabilities[0]}")

③升级版-将训练好的模型输出,以便后续直接调用,而不需重复训练

将模型输出保存为:""random_forest_model.pkl"

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import joblib  # For saving and loading the model

# Load the dataset
data = pd.read_csv('电信银行卡诈骗的数据分析.csv')

# Split the data into features (X) and target (y)
X = data.drop(columns=['Fraud'])
y = data['Fraud']

# Split the data into training, validation, and test sets (60% train, 20% validation, 20% test)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Train a Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Save the trained model to a file
model_filename = "random_forest_model.pkl"
joblib.dump(rf_classifier, model_filename)

# ---- Optional: You can print or calculate model metrics if necessary ----
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp)
    return accuracy, precision, recall, f1, specificity

# Predict and calculate metrics
y_train_pred = rf_classifier.predict(X_train)
train_metrics = calculate_metrics(y_train, y_train_pred)
print("Training Set Metrics:", train_metrics)

# Model is saved; ready for later use

④升级版-调用训练好的模型进行测试

import joblib
import pandas as pd

# Load the trained model from the file
model_filename = "random_forest_model.pkl"
rf_classifier = joblib.load(model_filename)

# ---- Testing on New Data ----
# Example of a new sample to test
new_sample = pd.DataFrame({
    'Distance1': [10],    # Replace with appropriate values
    'Distance2': [0.5],   # Replace with appropriate values
    'Ratio': [1.5],       # Replace with appropriate values
    'Repeat': [1],        # Replace with appropriate values
    'Card': [1],          # Replace with appropriate values
    'Pin': [0],           # Replace with appropriate values
    'Online': [1]         # Replace with appropriate values
})

# Predict the probability for the new sample
probabilities = rf_classifier.predict_proba(new_sample)

# Output the predicted class and probabilities
predicted_class = rf_classifier.predict(new_sample)
print(f"Predicted Class: {predicted_class[0]}")
print(f"Probabilities (class 0, class 1): {probabilities[0]}")

第四部分:公式使用(回归)

还未实现

第五部分:代码实现(回归)

还未实现

第六部分:资源获取

(1)上述的"电信银行卡诈骗的数据分析"表如下:

通过网盘分享的文件:电信银行卡诈骗的数据分析.zip
链接: https://pan.baidu.com/s/1Y25ZddKQ8YFgd31y96al0g?pwd=2dxk 提取码: 2dxk
--来自百度网盘超级会员v5的分享

好啦,希望能够帮助到大家!

  • 9
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

还不秃顶的计科生

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值