公开-自动数据挖掘与分析实操代码(免费领取)

目录

项目介绍

项目指南

数据读取

数据探索

数据处理

数据切分

数据模型


项目介绍

        本项目旨在自动实现数据挖掘与分析全流程,为用户更快速、精准洞察数据价值。 

        凡订阅如下专栏之一的读者: 

自然语言处理&大模型

数据分析&大模型

机器学习&大模型

均可免费领取自动数据挖掘与分析实操完整代码。代码git地址:

https://github.com/JavaLeb/auto_cda

        自动数据分析模块主要分为:

1、数据接入

2、数据探索

3、数据处理

4、数据切分

5、数据模型(模型选择、数据建模、模型评估、模型调优、模型预测)。

项目部分结构如下: 


项目指南

        项目中提供了多个样例,方便读者使用。下面以项目中提供的pm25预测为例说明。先给出完整代码(详见example/pm25_example.py):

from data_reader import DataReader
from data_explorer import DataExplorer
from data_splitter import DataSplitter
from data_processor import DataProcessor
from data_modeler import DataModeler
from data_configuration import Configuration


def auto_pm25():
    conf = Configuration(conf_path=r'../conf/pm25_ml_config.yml')

    # 数据读取
    data_reader = DataReader(ds_type='file', conf=conf)
    data = data_reader.read()

    # 数据探索.
    data_explorer = DataExplorer(data, conf=conf)
    data_explorer.explore()

    # 数据处理.
    data_processor = DataProcessor(conf=conf)
    data = data_processor.process(data)

    # 数据探索.
    data_explorer = DataExplorer(data, conf=conf)
    data_explorer.explore()

    data_splitter = DataSplitter(conf=conf)
    train_data_list, valid_data_list = data_splitter.split(data)
    train_data, valid_data = train_data_list[0], valid_data_list[0]

    # 数据建模.
    data_modeler = DataModeler(conf=conf)
    data_modeler.model(train_data, valid_data)

    # 读取测试数据.
    test_data = data_reader.read(train=False)
    # 探索测试数据.
    data_explorer = DataExplorer(test_data, conf=conf)
    data_explorer.explore()

    test_processed_data = data_processor.process(test_data)

    # 模型预测并保存预测结果.
    data_modeler.save_predict(test_data, test_processed_data)


if __name__ == '__main__':
    auto_pm25()

数据读取

        数据详见项目目录:data/pm25,目录下有训练数据pm25_train.csv和测试数据pm25_test.csv。训练数据文件内容如图:

使用本项目,三行代码即可完成数据读取,代码如下:

    conf = Configuration(conf_path=r'../conf/pm25_ml_config.yml')

    # 数据读取
    data_reader = DataReader(ds_type='file', conf=conf)
    data = data_reader.read()

数据探索

        数据探索可对数据的分布、变量关系等进行数据处理前的探索分析,使用本项目两行代码完成数据探索,代码如下:

    data_explorer = DataExplorer(data, conf=conf)
    data_explorer.explore()

部分结果输出如图: 

========================================================================================================================================================================================================
前10行数据(shape:(33746, 13))
|    |        date |       hour |   pm2.5 |      DEWP |       TEMP |       PRES |        Iws |         Is |        Ir |   cbwd_NE |   cbwd_NW |   cbwd_SE |   cbwd_cv |
|---:|------------:|-----------:|--------:|----------:|-----------:|-----------:|-----------:|-----------:|----------:|----------:|----------:|----------:|----------:|
|  0 | -0.093972   |  1.22689   |      15 |  0.849475 |  0.62083   | -0.330651  |  0.180401  | -0.0725599 | -0.136961 |  -0.3597  |   1.4593  | -0.734081 | -0.524471 |
|  1 |  0.516237   | -1.66044   |      50 | -0.196481 | -0.283649  | -1.30562   |  0.780893  | -0.0725599 | -0.136961 |  -0.3597  |  -0.68526 |  1.36225  | -0.524471 |
|  2 | -1.10593    | -1.37171   |     112 | -0.893786 | -1.4348    | -0.0381611 | -0.447357  | -0.0725599 | -0.136961 |  -0.3597  |  -0.68526 | -0.734081 |  1.90668  |
|  3 |  0.394954   | -1.22734   |      20 | -1.45163  | -1.76371   |  1.71678   | -0.310817  | -0.0725599 | -0.136961 |   2.78009 |  -0.68526 | -0.734081 | -0.524471 |
|  4 |  0.823237   |  1.08252   |      19 | -0.266212 |  0.62083   |  0.0593356 |  0.0985994 | -0.0725599 | -0.136961 |  -0.3597  |  -0.68526 |  1.36225  | -0.524471 |
|  5 |  0.571194   | -1.51607   |      79 |  0.152171 |  0.538605  | -0.330651  | -0.465468  | -0.0725599 | -0.136961 |   2.78009 |  -0.68526 | -0.734081 | -0.524471 |
|  6 |  1.55284    | -0.361142  |      62 |  0.221901 |  0.0452525 | -0.233154  | -0.447154  | -0.0725599 | -0.136961 |   2.78009 |  -0.68526 | -0.734081 | -0.524471 |
|  7 | -0.00490415 |  1.08252   |      32 |  1.05867  |  1.44308   | -1.20812   | -0.310614  | -0.0725599 | -0.136961 |  -0.3597  |  -0.68526 |  1.36225  | -0.524471 |
|  8 |  1.13403    | -0.0724098 |     196 | -0.475403 | -0.612551  |  0.741812  | -0.429043  | -0.0725599 | -0.136961 |  -0.3597  |  -0.68526 | -0.734081 |  1.90668  |
|  9 |  0.093639   | -0.938607  |      25 |  0.710014 |  0.0452525 |  0.0593356 | -0.41073   | -0.0725599 | -0.136961 |  -0.3597  |   1.4593  | -0.734081 | -0.524471 |
========================================================================================================================================================================================================
|相关性|>=0.5的变量:
    level_0  level_1     value
2     TEMP     PRES -0.825338
0     DEWP     TEMP  0.822793
1     DEWP     PRES -0.775286
3  cbwd_NW  cbwd_SE -0.503037
========================================================================================================================================================================================================
数据的摘要信息:
|    |   total-columns |   total-lines | dtypes      | ftypes               | memory-usage   |   total-missing-field |   total-duplicate-value |
|---:|----------------:|--------------:|:------------|:---------------------|:---------------|----------------------:|------------------------:|
|  0 |              13 |         33746 | float64(13) | 13(VALUE:3,CLASS:10) | 3.3 MB         |                     0 |                       0 |
========================================================================================================================================================================================================
数据列的摘要信息:
 |         |   Non-Null-Count | Dtype   |   Total-Count |   Unique-Count |   Unique-Count-Ratio |   count_desc |    mean_desc |   std_desc |   min_desc |   25%_desc |   50%_desc |    75%_desc |   max_desc | Ftype   |   Missing-Value-Count | First-Missing-Index   |
|:--------|-----------------:|:--------|--------------:|---------------:|---------------------:|-------------:|-------------:|-----------:|-----------:|-----------:|-----------:|------------:|-----------:|:--------|----------------------:|:----------------------|
| date    |            33746 | float64 |         33746 |           1533 |          0.0454276   |        33746 | -1.29713e-15 |    1.00001 | -1.75783   | -0.853891  |  0.0121514 |   0.866823  |    1.69875 | VALUE   |                     0 | None                  |
| hour    |            33746 | float64 |         33746 |             24 |          0.000711195 |        33746 | -2.41087e-17 |    1.00001 | -1.66044   | -0.938607  |  0.0719563 |   0.938153  |    1.65998 | CLASS   |                     0 | None                  |
| pm2.5   |            33746 | float64 |         33746 |            568 |          0.0168316   |        33746 | 98.6642      |   91.9102  |  0         | 29         | 72         | 137         |  994       | VALUE   |                     0 | None                  |
| DEWP    |            33746 | float64 |         33746 |             69 |          0.00204469  |        33746 | -8.94864e-19 |    1.00001 | -2.91597   | -0.824055  |  0.01271   |   0.919206  |    1.8257  | CLASS   |                     0 | None                  |
| TEMP    |            33746 | float64 |         33746 |             63 |          0.00186689  |        33746 |  2.31612e-18 |    1.00001 | -2.58596   | -0.859227  |  0.127478  |   0.867506  |    2.34756 | CLASS   |                     0 | None                  |
| PRES    |            33746 | float64 |         33746 |             58 |          0.00171872  |        33746 |  1.03341e-15 |    1.00001 | -2.37808   | -0.818135  | -0.0381611 |   0.839309  |    2.88674 | CLASS   |                     0 | None                  |
| Iws     |            33746 | float64 |         33746 |           2452 |          0.0726605   |        33746 |  8.35908e-17 |    1.00001 | -0.474421  | -0.447154  | -0.374305  |  -0.0377372 |   11.0234  | VALUE   |                     0 | None                  |
| Is      |            33746 | float64 |         33746 |             27 |          0.000800095 |        33746 |  5.68502e-18 |    1.00001 | -0.0725599 | -0.0725599 | -0.0725599 |  -0.0725599 |   33.4877  | CLASS   |                     0 | None                  |
| Ir      |            33746 | float64 |         33746 |             36 |          0.00106679  |        33746 |  2.21084e-18 |    1.00001 | -0.136961  | -0.136961  | -0.136961  |  -0.136961  |   24.3427  | CLASS   |                     0 | None                  |
| cbwd_NE |            33746 | float64 |         33746 |              2 |          5.92663e-05 |        33746 |  2.73197e-17 |    1.00001 | -0.3597    | -0.3597    | -0.3597    |  -0.3597    |    2.78009 | CLASS   |                     0 | None                  |
| cbwd_NW |            33746 | float64 |         33746 |              2 |          5.92663e-05 |        33746 |  6.41143e-17 |    1.00001 | -0.68526   | -0.68526   | -0.68526   |   1.4593    |    1.4593  | CLASS   |                     0 | None                  |
| cbwd_SE |            33746 | float64 |         33746 |              2 |          5.92663e-05 |        33746 |  5.34813e-17 |    1.00001 | -0.734081  | -0.734081  | -0.734081  |   1.36225   |    1.36225 | CLASS   |                     0 | None                  |
| cbwd_cv |            33746 | float64 |         33746 |              2 |          5.92663e-05 |        33746 |  6.96941e-17 |    1.00001 | -0.524471  | -0.524471  | -0.524471  |  -0.524471  |    1.90668 | CLASS   |                     0 | None                  |
========================================================================================================================================================================================================
数据的详细信息:
__________________________________________________
类别字段[hour]频率和占比分析:
(类别数:24)
        hour  count  proportion
0  -1.227339   1423    0.042168
1   1.659984   1422    0.042138
2   0.360689   1420    0.042079
3   0.649421   1416    0.041961
4  -0.505508   1416    0.041961
5  -1.660438   1416    0.041961
6   1.371252   1414    0.041901
7   0.505055   1413    0.041872
8  -0.649874   1412    0.041842
9  -1.082973   1411    0.041812
10  1.082519   1406    0.041664
11  0.793787   1406    0.041664
12  0.938153   1405    0.041635
13  0.216323   1403    0.041575
14 -0.361142   1402    0.041546
15 -1.516071   1402    0.041546
16 -1.371705   1400    0.041486
17  1.226886   1399    0.041457
18 -0.216776   1399    0.041457
19 -0.794241   1396    0.041368
20 -0.938607   1392    0.041249
21  1.515618   1392    0.041249
22 -0.072410   1391    0.041220
23  0.071956   1390    0.041190
__________________________________________________
类别字段[DEWP]频率和占比分析:
(类别数:69)
        DEWP  count  proportion
0   1.128397   1105    0.032745
1   1.058667   1019    0.030196
2   1.198127    993    0.029426
3   0.988936    982    0.029100
4   1.267858    929    0.027529
..       ...    ...         ...
64 -2.567316      3    0.000089
65 -2.706777      3    0.000089
66 -2.776508      2    0.000059
67 -2.846238      1    0.000030
68 -2.915969      1    0.000030

[69 rows x 3 columns]
__________________________________________________
类别字段[TEMP]频率和占比分析:
(类别数:63)
        TEMP  count  proportion
0   0.949732   1190    0.035263
1   0.867506   1150    0.034078
2   0.785281   1115    0.033041
3   0.703056   1084    0.032122
4   1.031957   1062    0.031470
..       ...    ...         ...
58  2.347564      4    0.000119
59  2.183113      2    0.000059
60 -2.585960      2    0.000059
61 -0.256241      1    0.000030
62  0.182295      1    0.000030

[63 rows x 3 columns]
__________________________________________________
类别字段[PRES]频率和占比分析:
(类别数:58)
        PRES  count  proportion
0  -0.233154   1124    0.033308
1  -1.013128   1118    0.033130
2   0.839309   1110    0.032893
3  -0.428148   1063    0.031500
4   0.644316   1059    0.031381
5  -0.330651   1047    0.031026
6  -0.915631   1034    0.030641
7   0.059336   1019    0.030196
8   0.351826   1019    0.030196
9  -0.135658   1015    0.030078
10 -0.525645    999    0.029604
11  0.156832    997    0.029544
12 -0.720638    997    0.029544
13  0.936806    992    0.029396
14  0.449322    982    0.029100
15 -0.818135    978    0.028981
16  0.741812    971    0.028774
17  1.034302    958    0.028389
18 -0.038161    957    0.028359
19 -1.110625    954    0.028270
20  0.254329    951    0.028181
21 -1.305618    946    0.028033
22 -0.623141    946    0.028033
23  0.546819    936    0.027737
24 -1.208121    927    0.027470
25  1.131799    884    0.026196
26 -1.403115    861    0.025514
27  1.229296    855    0.025336
28  1.326792    736    0.021810
29 -1.500611    619    0.018343
30  1.424289    586    0.017365
31  1.521786    584    0.017306
32 -1.598108    489    0.014491
33  1.619282    477    0.014135
34  1.716779    386    0.011438
35  1.814276    359    0.010638
36 -1.695605    331    0.009809
37 -1.793101    322    0.009542
38  1.911772    228    0.006756
39 -1.890598    176    0.005215
40  2.009269    168    0.004978
41 -1.988095    118    0.003497
42  2.106766    106    0.003141
43 -2.085591     92    0.002726
44  2.204263     75    0.002222
45  2.301759     70    0.002074
46 -2.183088     49    0.001452
47  2.399256     34    0.001008
48 -2.280585     15    0.000444
49  2.496753     15    0.000444
50  2.691746      2    0.000059
51 -2.378081      2    0.000059
52  2.594249      2    0.000059
53  2.886739      2    0.000059
54  0.303077      1    0.000030
55  2.789243      1    0.000030
56  1.294294      1    0.000030
57  1.554285      1    0.000030
__________________________________________________
类别字段[Is]频率和占比分析:
(类别数:27)
           Is  count  proportion
0   -0.072560  33441    0.990962
1    1.218219     55    0.001630
2    2.508998     33    0.000978
3    3.799777     30    0.000889
4    5.090556     27    0.000800
5    6.381335     25    0.000741
6    7.672114     24    0.000711
7   10.253672     18    0.000533
8    8.962893     17    0.000504
9   11.544450     14    0.000415
10  12.835229     11    0.000326
11  14.126008      8    0.000237
12  15.416787      6    0.000178
13  16.707566      5    0.000148
14  17.998345      5    0.000148
15  19.289124      4    0.000119
16  20.579903      4    0.000119
17  24.452240      3    0.000089
18  21.870682      3    0.000089
19  29.615355      2    0.000059
20  27.033798      2    0.000059
21  25.743019      2    0.000059
22  23.161461      2    0.000059
23  28.324577      2    0.000059
24  32.196913      1    0.000030
25  33.487692      1    0.000030
26  30.906134      1    0.000030
__________________________________________________
类别字段[Ir]频率和占比分析:
(类别数:36)
           Ir  count  proportion
0   -0.136961  32353    0.958721
1    0.543030    393    0.011646
2    1.223021    245    0.007260
3    1.903012    169    0.005008
4    2.583004    102    0.003023
5    3.262995     89    0.002637
6    3.942986     70    0.002074
7    4.622978     59    0.001748
8    5.302969     43    0.001274
9    5.982960     32    0.000948
10   6.662952     27    0.000800
11   7.342943     24    0.000711
12   8.022934     21    0.000622
13   8.702925     16    0.000474
14   9.382917     12    0.000356
15  11.422891      9    0.000267
16  12.782873      9    0.000267
17  10.742899      9    0.000267
18  13.462864      8    0.000237
19  10.062908      8    0.000237
20  14.822847      7    0.000207
21  15.502838      7    0.000207
22  14.142856      6    0.000178
23  12.102882      6    0.000178
24  16.182830      3    0.000089
25  16.862821      3    0.000089
26  17.542812      2    0.000059
27  20.262777      2    0.000059
28  20.942769      2    0.000059
29  19.582786      2    0.000059
30  18.222804      2    0.000059
31  21.622760      2    0.000059
32  23.662734      1    0.000030
33  24.342725      1    0.000030
34  18.902795      1    0.000030
35  22.982743      1    0.000030
__________________________________________________
类别字段[cbwd_NE]频率和占比分析:
(类别数:2)
    cbwd_NE  count  proportion
0 -0.359700  29880    0.885438
1  2.780093   3866    0.114562
__________________________________________________
类别字段[cbwd_NW]频率和占比分析:
(类别数:2)
   cbwd_NW  count  proportion
0 -0.68526  22963    0.680466
1  1.45930  10783    0.319534
__________________________________________________
类别字段[cbwd_SE]频率和占比分析:
(类别数:2)
    cbwd_SE  count  proportion
0 -0.734081  21929    0.649825
1  1.362247  11817    0.350175
__________________________________________________
类别字段[cbwd_cv]频率和占比分析:
(类别数:2)
    cbwd_cv  count  proportion
0 -0.524471  26466    0.784271
1  1.906683   7280    0.215729
__________________________________________________
数据的相关性矩阵:
|         |         date |        hour |      pm2.5 |       DEWP |        TEMP |       PRES |         Iws |          Is |          Ir |     cbwd_NE |    cbwd_NW |      cbwd_SE |     cbwd_cv |
|:--------|-------------:|------------:|-----------:|-----------:|------------:|-----------:|------------:|------------:|------------:|------------:|-----------:|-------------:|------------:|
| date    |  1           | -0.00147873 | -0.0255648 |  0.0538985 |  0.0870263  | -0.0258579 | -0.0603007  | -0.026741   | -0.0136163  |  0.00462793 | -0.0439984 |  0.000837741 |  0.0453233  |
| hour    | -0.00147873  |  1          | -0.0287458 | -0.025113  |  0.149418   | -0.0406005 |  0.0654733  | -0.005116   | -0.00859084 | -0.0685987  | -0.122566  |  0.211881    | -0.0536609  |
| pm2.5   | -0.0255648   | -0.0287458  |  1         |  0.16623   | -0.0943591  | -0.0443735 | -0.248153   |  0.0248505  | -0.055547   | -0.0341627  | -0.209001  |  0.0901709   |  0.158811   |
| DEWP    |  0.0538985   | -0.025113   |  0.16623   |  1         |  0.822793   | -0.775286  | -0.289432   | -0.0359728  |  0.127166   | -0.0333757  | -0.333827  |  0.27162     |  0.0892787  |
| TEMP    |  0.0870263   |  0.149418   | -0.0943591 |  0.822793  |  1          | -0.825338  | -0.149866   | -0.096841   |  0.0509266  | -0.0606246  | -0.265371  |  0.307028    | -0.00829014 |
| PRES    | -0.0258579   | -0.0406005  | -0.0443735 | -0.775286  | -0.825338   |  1         |  0.174242   |  0.0716255  | -0.0846744  |  0.0635018  |  0.225794  | -0.24731     | -0.0183287  |
| Iws     | -0.0603007   |  0.0654733  | -0.248153  | -0.289432  | -0.149866   |  0.174242  |  1          |  0.0174077  | -0.00631762 | -0.115665   |  0.360509  | -0.0771978   | -0.229599   |
| Is      | -0.026741    | -0.005116   |  0.0248505 | -0.0359728 | -0.096841   |  0.0716255 |  0.0174077  |  1          | -0.00993791 | -0.00556331 | -0.0221607 |  0.0361405   | -0.012483   |
| Ir      | -0.0136163   | -0.00859084 | -0.055547  |  0.127166  |  0.0509266  | -0.0846744 | -0.00631762 | -0.00993791 |  1          |  0.0407016  |  0.0293042 | -0.0385724   | -0.0200026  |
| cbwd_NE |  0.00462793  | -0.0685987  | -0.0341627 | -0.0333757 | -0.0606246  |  0.0635018 | -0.115665   | -0.00556331 |  0.0407016  |  1          | -0.246488  | -0.264049    | -0.188652   |
| cbwd_NW | -0.0439984   | -0.122566   | -0.209001  | -0.333827  | -0.265371   |  0.225794  |  0.360509   | -0.0221607  |  0.0293042  | -0.246488   |  1         | -0.503037    | -0.359399   |
| cbwd_SE |  0.000837741 |  0.211881   |  0.0901709 |  0.27162   |  0.307028   | -0.24731   | -0.0771978  |  0.0361405  | -0.0385724  | -0.264049   | -0.503037  |  1           | -0.385004   |
| cbwd_cv |  0.0453233   | -0.0536609  |  0.158811  |  0.0892787 | -0.00829014 | -0.0183287 | -0.229599   | -0.012483   | -0.0200026  | -0.188652   | -0.359399  | -0.385004    |  1          |

数据处理

        使用本项目数据处理,可完成数据清洗、转换、编码等,代码如下:

    # 数据处理.
    data_processor = DataProcessor(conf=conf)
    data = data_processor.process(data)

数据切分

        数据切分主要是将训练数据切分为训练集和验证集。支持各种数据切分,包括简单交叉验证、K折交叉验证、留一法交叉验证、留P法交叉验证等,代码:

data_splitter = DataSplitter(conf=conf)
train_data_list, valid_data_list = data_splitter.split(data)
train_data, valid_data = train_data_list[0], valid_data_list[0]
数据划分摘要:
 |    | splitter   |   train_size | shuffle   |   random_state |   total_count |   train_count |   valid_count |
|---:|:-----------|-------------:|:----------|---------------:|--------------:|--------------:|--------------:|
|  0 | simple     |          0.7 | True      |             42 |         33746 |         23622 |         10124 |

数据模型

        数据模型主要功能有模型选择、模型训练、模型评估、模型调优等,代码如下:

    # 数据建模.
    data_modeler = DataModeler(conf=conf)
    data_modeler.model(train_data, valid_data)

数据模型将通过自动调优选择最佳模型,部分输出如下: 

数据模型摘要:
 |    | model                                                          | best_params_model                                              |   valid-root_mean_squared_error |   train-root_mean_squared_error |
|---:|:---------------------------------------------------------------|:---------------------------------------------------------------|--------------------------------:|--------------------------------:|
|  0 | LinearRegression()                                             | LinearRegression()                                             |                         79.0689 |                         78.6081 |
|  1 | LinearSVR()                                                    | LinearSVR()                                                    |                         83.4089 |                         83.0077 |
|  2 | GridSearchCV(estimator=SVR(), param_grid={'kernel': ['poly']}) | GridSearchCV(estimator=SVR(), param_grid={'kernel': ['poly']}) |                         85.683  |                         84.9634 |
========================================================================================================================================================================================================
最佳数据模型摘要:
 |    | assessment              | model              | model_params                                                               |
|---:|:------------------------|:-------------------|:---------------------------------------------------------------------------|
|  0 | root_mean_squared_error | LinearRegression() | {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False} |

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ErbaoLiu

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值