汽车保险欺诈预测

最新推荐文章于 2025-04-14 11:51:18 发布

weixin_26713457

最新推荐文章于 2025-04-14 11:51:18 发布

阅读量1.4k

点赞数

文章标签： python 机器学习

原文链接：https://medium.com/@neel.roy/auto-insurance-fraud-prediction-f3e7cfba8f1d

版权

本文探讨了一项关于汽车保险欺诈预测的分析，数据集包含2015年1月到3月间Ohio, Indiana, Illinois三州的欺诈索赔信息。数据不平衡，非欺诈案例远超欺诈案例。通过EDA发现欺诈案件在某些州尤为严重，且特定爱好与欺诈行为有关。此外，处理缺失值采用了MICE插补方法。最终，使用XGBoost构建的模型在识别欺诈行为方面表现出色，F1分数和召回率分别为63.55和61.99，AUC得分85.51%，测试集准确度达82%。" 88540217,4895675,Node.js 中的CPU密集型任务处理策略,"['Node.js', '多进程管理', '性能优化', '子服务模块']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Dear Readers, This is my very first article on Medium. This is about an auto insurance fraud prediction. Fraud predictions are usually an Imbalanced dataset with more legit claims than fraudulent Claims.

尊敬的读者，这是我关于Medium的第一篇文章。这是关于汽车保险欺诈预测的。欺诈预测通常是不合法的数据集，其合法性主张要比欺诈性主张多。

Problem Statement:

问题陈述：

These days lot of insurance companies , deal with fraudulent claims. The frauds can be at different stages , either at the stage of filling the proposal or at the time of claims like staging an accident or claiming pre-existing Damages. Frauds are committed to achieving personal gains. The data set I worked on has is called an imbalanced dataset with legit claims being far come as compared to fraudulent Claims. According to the FBI, non-health insurance fraud costs an estimated $40 billion per year, which increases the premiums for the average U.S. family between $400 and $700 annually

这些天很多保险公司，处理欺诈性索赔。欺诈可以处于不同阶段，可以在填写提案的阶段，也可以在索赔时(例如上演事故或索赔预先存在的损害赔偿)。欺诈致力于实现个人利益。我处理的数据集称为不平衡数据集，与欺诈性索赔相比，合法索赔远远没有达到。根据FBI的数据，非健康保险欺诈每年估计造成400亿美元的损失，这使美国普通家庭每年的保费收入增加了400到700美元

About the Dataset:

关于数据集：

The dataset has 1000 observations with 39 features. The dataset contains information about fraudulent claims from 01-Jan-2015 to 01-March-2015 in the state of Ohio,Indiana,Illinois. The data given does not mention the insurance company. So we are not aware that whether it is from an single insurance or multiple insurance companies. The obvious drawback about this dataset is that it has only 1000 observations.

数据集具有1000个具有39个特征的观测值。数据集包含有关伊利诺伊州印第安纳州俄亥俄州从2015年1月1日至2015年3月1日的欺诈性索赔的信息。给出的数据没有提及保险公司。因此，我们不知道它来自单个保险公司还是多个保险公司。此数据集的明显缺点是它只有1000个观测值。

EDA(Exploratory Data Analysis)

EDA(探索性数据分析)

The given Dataset has 1000 observations and 39 features,with the column fraud reported being the dependent variable(the variable that we wish to predict). The dependent Variable has 753 non-fraudulent cases and 247 fraudulent cases.

给定的数据集具有1000个观测值和39个特征，其中报告的列欺诈是因变量(我们希望预测的变量)。因变量有753个非欺诈案件和247个欺诈案件。