【Python】基于机器学习的财务数据分析——识别财务造假

【Python】基于机器学习的财务数据分析——识别财务造假

前言:

本文数据使用了2021泰迪杯官方给出的数据。

其中第一章的代码给出了如何由比赛数据生成案例分析所使用的数据

而第二章则重点介绍了 如何通过上一章的数据进行财务数据分析


第一章 生成财务分析数据

1.1 Load 比赛的官方数据

# load lab
import pandas as pd
import numpy as np


# load data
df = pd.read_csv('附件2(样例数据).csv')
df.info()
df
TICKER_SYMBOLACT_PUBTIMEPUBLISH_DATEEND_DATE_REPEND_DATEREPORT_TYPEFISCAL_PERIODMERGED_FLAGACCOUTING_STANDARDSCURRENCY_CD...CA_TURNOVEROPER_CYCLEINVEN_TURNOVERFA_TURNOVERTFA_TURNOVERDAYS_APDAYS_INVENTA_TURNOVERAR_TURNOVERFLAG
040195543A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0
142135553A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0
281665533A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0
390635543A121CHAS_2007CNY...5.34013.2947NaN4009.25844402.61791.2731NaN4.6236109.26670
4100835543A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0
..................................................................
355849928585543A121CHAS_2007CNY...0.06016785.72450.053532.854094.762376.30166732.36150.05636.74630
355949932014433A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0
356049932976653A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0
356149988085543A121CHAS_2007CNY...NaNNaNNaNNaNNaNNaNNaNNaNNaN0
356249997095543A121CHAS_2007CNY...1.264896.87944.75110.21660.2019169.428875.77250.149217.05600

3563 rows × 363 columns

1.2 数据处理

df1 = df.copy()
df2 = df1.dropna(axis=1)
df2.dropna(axis=0)
TICKER_SYMBOLACT_PUBTIMEPUBLISH_DATEEND_DATE_REPEND_DATEREPORT_TYPEFISCAL_PERIODMERGED_FLAGACCOUTING_STANDARDSCURRENCY_CD...REVENUET_REVENUET_PROFITOPERATE_PROFITCOMPR_INC_ATTR_PT_COMPR_INCOMEN_INCOME_ATTR_PN_INCOMET_COGSFLAG
040195543A121CHAS_2007CNY...6.144704e+086.144704e+08-1.081696e+08-1.229547e+08-9.021130e+07-9.021130e+07-9.021130e+07-9.021130e+077.362240e+080
142135553A121CHAS_2007CNY...1.103340e+091.103340e+091.440264e+081.417321e+081.173408e+081.173408e+081.173408e+081.173408e+089.609019e+080
281665533A121CHAS_2007CNY...6.768557e+096.768557e+095.501612e+084.160823e+084.427106e+084.427106e+084.406464e+084.406464e+086.374538e+090
390635543A121CHAS_2007CNY...3.649249e+103.649249e+104.442545e+084.178863e+083.603373e+083.590776e+083.582523e+083.569925e+083.610338e+100
4100835543A121CHAS_2007CNY...4.090022e+084.090022e+085.522668e+075.422161e+074.059508e+074.059508e+074.059508e+074.059508e+073.547806e+080
..................................................................
355849928585543A121CHAS_2007CNY...1.916369e+081.916369e+08-5.854943e+07-6.549837e+07-5.698415e+07-5.888314e+07-5.698415e+07-5.888314e+072.563552e+080
355949932014433A121CHAS_2007CNY...7.584324e+087.584324e+084.562972e+074.016267e+073.719758e+073.704166e+073.719758e+073.704166e+077.174872e+080
356049932976653A121CHAS_2007CNY...8.707040e+088.707040e+081.657144e+081.204156e+081.428394e+081.623189e+081.422310e+081.615584e+087.586088e+080
356149988085543A121CHAS_2007CNY...1.345086e+091.345086e+098.585497e+077.220415e+077.851344e+077.895948e+077.500529e+077.545319e+071.275262e+090
356249997095543A121CHAS_2007CNY...1.260865e+101.260865e+103.371832e+093.297172e+092.258473e+092.677414e+092.460094e+092.870661e+091.165588e+100

3563 rows × 34 columns

df2.to_excel('winner.xlsx',index = False)
df3 = df2.iloc[:,11:]
df3
FIXED_ASSETST_ASSETST_LIABT_EQUITY_ATTR_PT_SH_EQUITYT_LIAB_EQUITYN_CF_OPERATE_AC_PAID_FOR_OTH_OP_AC_INF_FR_OPERATE_AN_CHANGE_IN_CASH...REVENUET_REVENUET_PROFITOPERATE_PROFITCOMPR_INC_ATTR_PT_COMPR_INCOMEN_INCOME_ATTR_PN_INCOMET_COGSFLAG
01.384757e+093.360603e+099.086384e+082.451965e+092.451965e+093.360603e+09-1.531725e+088.092480e+075.484389e+08-5.863646e+07...6.144704e+086.144704e+08-1.081696e+08-1.229547e+08-9.021130e+07-9.021130e+07-9.021130e+07-9.021130e+077.362240e+080
13.550098e+081.801029e+094.622409e+081.338788e+091.338788e+091.801029e+091.530939e+083.725100e+071.059183e+09-2.576530e+07...1.103340e+091.103340e+091.440264e+081.417321e+081.173408e+081.173408e+081.173408e+081.173408e+089.609019e+080
21.014511e+099.033789e+094.241620e+094.792168e+094.792168e+099.033789e+098.985383e+081.011182e+097.778336e+09-7.102159e+08...6.768557e+096.768557e+095.501612e+084.160823e+084.427106e+084.427106e+084.406464e+084.406464e+086.374538e+090
32.592562e+091.445027e+109.246911e+095.049558e+095.203356e+091.445027e+104.287553e+083.640225e+094.261880e+102.915535e+08...3.649249e+103.649249e+104.442545e+084.178863e+083.603373e+083.590776e+083.582523e+083.569925e+083.610338e+100
44.234430e+075.389908e+082.419290e+082.970618e+082.970618e+085.389908e+083.351069e+072.934517e+074.872505e+081.073503e+07...4.090022e+084.090022e+085.522668e+075.422161e+074.059508e+074.059508e+074.059508e+074.059508e+073.547806e+080
..................................................................
35581.542683e+087.881017e+083.139182e+084.168727e+084.741834e+087.881017e+08-5.238869e+061.838851e+071.844894e+08-9.722585e+07...1.916369e+081.916369e+08-5.854943e+07-6.549837e+07-5.698415e+07-5.888314e+07-5.698415e+07-5.888314e+072.563552e+080
35592.945962e+088.894905e+082.485487e+086.370977e+086.409417e+088.894905e+088.333100e+077.960547e+078.047397e+08-8.111294e+07...7.584324e+087.584324e+084.562972e+074.016267e+073.719758e+073.704166e+073.719758e+073.704166e+077.174872e+080
35603.435635e+095.628529e+092.599046e+092.774693e+093.029483e+095.628529e+094.344513e+082.430308e+071.084828e+098.119653e+07...8.707040e+088.707040e+081.657144e+081.204156e+081.428394e+081.623189e+081.422310e+081.615584e+087.586088e+080
35611.963836e+094.936854e+093.018880e+091.878145e+091.917974e+094.936854e+097.512584e+072.379392e+081.873552e+091.619822e+08...1.345086e+091.345086e+098.585497e+077.220415e+077.851344e+077.895948e+077.500529e+077.545319e+071.275262e+090
35624.760795e+093.820173e+101.753208e+101.818157e+102.066965e+103.820173e+101.621028e+093.193049e+091.472222e+103.384381e+08...1.260865e+101.260865e+103.371832e+093.297172e+092.258473e+092.677414e+092.460094e+092.870661e+091.165588e+100

3563 rows × 23 columns

df3.to_excel('raw_data.xlsx',index = False)

第二章 基于机器学习的财务数据分析——识别财务造假

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_excel('raw_data.xlsx')
df
FIXED_ASSETST_ASSETST_LIABT_EQUITY_ATTR_PT_SH_EQUITYT_LIAB_EQUITYN_CF_OPERATE_AC_PAID_FOR_OTH_OP_AC_INF_FR_OPERATE_AN_CHANGE_IN_CASH...REVENUET_REVENUET_PROFITOPERATE_PROFITCOMPR_INC_ATTR_PT_COMPR_INCOMEN_INCOME_ATTR_PN_INCOMET_COGSFLAG
01.384757e+093.360603e+099.086384e+082.451965e+092.451965e+093.360603e+09-1.531725e+088.092480e+075.484389e+08-5.863646e+07...6.144704e+086.144704e+08-1.081696e+08-1.229547e+08-9.021130e+07-9.021130e+07-9.021130e+07-9.021130e+077.362240e+080
13.550098e+081.801029e+094.622409e+081.338788e+091.338788e+091.801029e+091.530939e+083.725100e+071.059183e+09-2.576530e+07...1.103340e+091.103340e+091.440264e+081.417321e+081.173408e+081.173408e+081.173408e+081.173408e+089.609019e+080
21.014511e+099.033789e+094.241620e+094.792168e+094.792168e+099.033789e+098.985383e+081.011182e+097.778336e+09-7.102159e+08...6.768557e+096.768557e+095.501612e+084.160823e+084.427106e+084.427106e+084.406464e+084.406464e+086.374538e+090
32.592562e+091.445027e+109.246911e+095.049558e+095.203356e+091.445027e+104.287553e+083.640225e+094.261880e+102.915535e+08...3.649249e+103.649249e+104.442545e+084.178863e+083.603373e+083.590776e+083.582523e+083.569925e+083.610338e+100
44.234430e+075.389908e+082.419290e+082.970618e+082.970618e+085.389908e+083.351069e+072.934517e+074.872505e+081.073503e+07...4.090022e+084.090022e+085.522668e+075.422161e+074.059508e+074.059508e+074.059508e+074.059508e+073.547806e+080
..................................................................
35581.542683e+087.881017e+083.139182e+084.168727e+084.741834e+087.881017e+08-5.238869e+061.838851e+071.844894e+08-9.722585e+07...1.916369e+081.916369e+08-5.854943e+07-6.549837e+07-5.698415e+07-5.888314e+07-5.698415e+07-5.888314e+072.563552e+080
35592.945962e+088.894905e+082.485487e+086.370977e+086.409417e+088.894905e+088.333100e+077.960547e+078.047397e+08-8.111294e+07...7.584324e+087.584324e+084.562972e+074.016267e+073.719758e+073.704166e+073.719758e+073.704166e+077.174872e+080
35603.435635e+095.628529e+092.599046e+092.774693e+093.029483e+095.628529e+094.344513e+082.430308e+071.084828e+098.119653e+07...8.707040e+088.707040e+081.657144e+081.204156e+081.428394e+081.623189e+081.422310e+081.615584e+087.586088e+080
35611.963836e+094.936854e+093.018880e+091.878145e+091.917974e+094.936854e+097.512584e+072.379392e+081.873552e+091.619822e+08...1.345086e+091.345086e+098.585497e+077.220415e+077.851344e+077.895948e+077.500529e+077.545319e+071.275262e+090
35624.760795e+093.820173e+101.753208e+101.818157e+102.066965e+103.820173e+101.621028e+093.193049e+091.472222e+103.384381e+08...1.260865e+101.260865e+103.371832e+093.297172e+092.258473e+092.677414e+092.460094e+092.870661e+091.165588e+100

3563 rows × 23 columns

X, y = df.iloc[:,:-1],df.iloc[:,-1]
y
0       0
1       0
2       0
3       0
4       0
       ..
3558    0
3559    0
3560    0
3561    0
3562    0
Name: FLAG, Length: 3563, dtype: int64
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 对数据进行标准化处理, 主要是X_train
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss = ss.fit(X_train)
X_train_std = ss.fit_transform(X_train)
X_test_std = ss.fit_transform(X_test)
#---------------  Modllong

#---------------  Modllong

# SVM Classifier  
def svm_classifier(train_x, train_y):  
    from sklearn.svm import SVC  
    model = SVC(kernel='rbf', probability=True)  
    model.fit(train_x, train_y)  
    return model 


# KNN Classifier  
def knn_classifier(train_x, train_y):  
    from sklearn.neighbors import KNeighborsClassifier  
    model = KNeighborsClassifier()  
    model.fit(train_x, train_y)  
    return model  
    
# Logistic Regression Classifier  
def logistic_regression_classifier(train_x, train_y):  
    from sklearn.linear_model import LogisticRegression  
    model = LogisticRegression(penalty='l2')  
    model.fit(train_x, train_y)  
    return model 

# DT
def dt(train_x, train_y):  
    from sklearn import tree
    model = tree.DecisionTreeClassifier()
    model.fit(train_x, train_y)  
    return model

# nn
def nn(train_x, train_y):  
    from sklearn.neural_network import MLPClassifier
    model = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(5, 2), random_state=1)
    model.fit(train_x, train_y)  
    return model


train_x = X_train_std
train_y = y_train
model_svc = svm_classifier(train_x, train_y)
model_knn = knn_classifier(train_x, train_y)
model_logistic =  logistic_regression_classifier(train_x, train_y)
model_dt = dt(train_x, train_y)
model_nn = nn(train_x, train_y)
# ----------
y_svc = model_svc.predict(X_test_std)
y_knn = model_knn.predict(X_test_std)
y_logistic = model_logistic.predict(X_test_std)
y_dt = model_dt.predict(X_test_std)
y_nn = model_nn.predict(X_test_std)
# 结果分析

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import  precision_score
from sklearn.metrics import recall_score

# print('分类准确率为:',accuracy_score(y_test,y_svc),accuracy_score(y_test,y_knn),accuracy_score(y_test,y_logistic),accuracy_score(y_test,y_rf))
# print('宏平均准确率:',precision_score(y_test,y_svc,average='macro'),precision_score(y_test,y_knn,average='macro'),precision_score(y_test,y_logistic,average='macro'),precision_score(y_test,y_rf,average='macro'))
# print('微平均准确率:',precision_score(y_test,y_svc,average='micro'),precision_score(y_test,y_knn,average='micro'),precision_score(y_test,y_logistic,average='micro'),precision_score(y_test,y_rf,average='micro'))
# print('宏平均召回率为:',recall_score(y_test,y_svc,average='macro'),recall_score(y_test,y_knn,average='macro'),recall_score(y_test,y_logistic,average='macro'),recall_score(y_test,y_rf,average='macro'))
# print('微平均召回率为:',recall_score(y_test,y_svc,average='micro'),recall_score(y_test,y_knn,average='micro'),recall_score(y_test,y_logistic,average='micro'),recall_score(y_test,y_rf,average='micro'))
# print('宏平均f1值为:',f1_score(y_test,y_svc,average='macro'),f1_score(y_test,y_knn,average='macro'),f1_score(y_test,y_logistic,average='macro'),f1_score(y_test,y_rf,average='macro'))
# print('微平均f1值为:',f1_score(y_test,y_svc,average='micro'),f1_score(y_test,y_knn,average='micro'),f1_score(y_test,y_logistic,average='micro'),f1_score(y_test,y_rf,average='micro'))
# 误差评估
from sklearn.metrics import confusion_matrix

# svc
C=confusion_matrix(y_test, y_svc)
# df=pd.DataFrame(C,index=["财务造假", "财务不造假"],columns=["财务造假", "财务不造假"])
# sns.heatmap(df,annot=True)
plt.matshow(C, cmap=plt.cm.Greens) 
plt.colorbar()
for i in range(len(C)): 
    for j in range(len(C)):
        plt.annotate(C[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label') 
plt.title('SVC')
plt.show()

# knn
C=confusion_matrix(y_test, y_knn)
# df=pd.DataFrame(C,index=["财务造假", "财务不造假"],columns=["财务造假", "财务不造假"])
# sns.heatmap(df,annot=True)
plt.matshow(C, cmap=plt.cm.Greens) 
plt.colorbar()
for i in range(len(C)): 
    for j in range(len(C)):
        plt.annotate(C[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label') 
plt.title('KNN')
plt.show()


# log
C=confusion_matrix(y_test, y_logistic)
# df=pd.DataFrame(C,index=["财务造假", "财务不造假"],columns=["财务造假", "财务不造假"])
# sns.heatmap(df,annot=True)
plt.matshow(C, cmap=plt.cm.Greens) 
plt.colorbar()
for i in range(len(C)): 
    for j in range(len(C)):
        plt.annotate(C[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label') 
plt.title('Logistic')
plt.show()


# dt
C=confusion_matrix(y_test, y_dt)
# df=pd.DataFrame(C,index=["财务造假", "财务不造假"],columns=["财务造假", "财务不造假"])
# sns.heatmap(df,annot=True)
plt.matshow(C, cmap=plt.cm.Greens) 
plt.colorbar()
for i in range(len(C)): 
    for j in range(len(C)):
        plt.annotate(C[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label') 
plt.title('dt')
plt.show()



# nn
C=confusion_matrix(y_test, y_nn)
# df=pd.DataFrame(C,index=["财务造假", "财务不造假"],columns=["财务造假", "财务不造假"])
# sns.heatmap(df,annot=True)
plt.matshow(C, cmap=plt.cm.Greens) 
plt.colorbar()
for i in range(len(C)): 
    for j in range(len(C)):
        plt.annotate(C[i,j], xy=(i, j), horizontalalignment='center', verticalalignment='center')
plt.ylabel('True label')
plt.xlabel('Predicted label') 
plt.title('nn')
plt.show()

在这里插入图片描述

请添加图片描述

请添加图片描述
请添加图片描述
请添加图片描述

### 回答1: 这个错误是由于两个数组的形状不兼容导致的。其中一个数组的形状是(none, 1),另一个数组的形状是(none, 2)。这意味着它们的行数相同,但列数不同。在某些情况下,这可能是由于数据类型不匹配或数据维度不正确引起的。您需要检查数据并确保它们具有相同的形状和数据类型。如果需要,您可以使用numpy库中的reshape函数来更改数组的形状。 ### 回答2: ValueError: shapes (None, 1) and (None, 2) are incompatible。是Python语言中常见的错误,通常出现在人工智能机器学习等领域。这个错误提示显示的是两个数组的形状(Shapes)不兼容。简单来说,就是指两个数组的维度不一致,无法进行运算。 其中,None代表的是数组的尺寸,意味着这个维度大小可以被任意赋值,但是两个数组在某些维度上的大小是不匹配的。这个问题通常可以通过改变数组形状或对数组进行重新组合来解决。 实际上,这个错误可能涉及到函数、方法、操作、层、参数等各种因素。其中,常见的原因是两个数组中的一部分维度大小不匹配、缺少数据或维度没有进行扩展等。在解决这个错误的过程中,需要认真检查代码中涉及到的所有参数和变量,特别是需要仔细检查数组的形状、大小和数据类型是否匹配。 在数据科学领域中,这个错误通常会出现在机器学习的模型训练和预测过程中。如果两个数组的维度不匹配,可能会导致无法正常训练模型或预测出错。因此,在使用Python进行数据处理和机器学习的过程中,需要注意数组的形状和大小,以避免这个错误的出现。 总之,ValueError: shapes (None, 1) and (None, 2) are incompatible。这个错误提示意味着两个数组的形状不兼容,需要进行调整和匹配才能进行运算。在处理数据和编写代码时需要认真检查数据的大小、形状和类型,以避免这个错误的出现。 ### 回答3: 这个错误是由于两个numpy数组的形状不兼容而导致的。在 numpy 中,数组的形状是非常重要的,不同的形状可以产生不同的结果,如果两个数组的形状不兼容,就会出现这种 ValueError。 首先我们需要了解一下 numpy 数组的形状。在 numpy 中,数组的形状通常由两个属性组成:维度和大小。维度表示数组的维度数量,大小表示每个维度上的元素数量。比如一个二维数组的形状可以表示为 (3, 4),其中 3 表示该数组有 3 个维度,4 表示每个维度上有 4 个元素。 在出现 "shapes (none, 1) and (none, 2) are incompatible" 的错误时,通常是因为两个数组的形状在维度数量或者每个维度上的元素数量上不匹配。其中, (none, 1) 表示第一个数组的形状为一维数组,大小为 none 表示元素数量未知,后面的 1 表示每个维度上有 1 个元素。同理,(none, 2) 表示第二个数组的形状为一维数组,大小为 none 表示元素数量未知,后面的 2 表示每个维度上有 2 个元素。 针对这种错误,我们需要检查一下代码中两个数组的形状是否一致。如果不一致,我们需要进行相应的修改,使它们的形状兼容。具体的做法可以通过 numpy 提供的一些数组操作函数来实现,比如 reshape()函数可以改变数组的形状, concatenate()函数可以将两个数组拼接在一起,等等。 在解决这个错误时,我们还需要注意一些细节,比如 numpy 中不同操作函数对数组形状的要求可能会有所不同,我们需要根据具体情况进行选择。此外,我们还需要注意避免在操作数组时出现类型不一致的错误,比如将字符串类型的数组和浮点类型的数组进行拼接,这也可能会导致 ValueError 错误的出现。
评论 9
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

春风惹人醉

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值