目录
项目介绍
本项目旨在自动实现数据挖掘与分析全流程,为用户更快速、精准洞察数据价值。
凡订阅如下专栏之一的读者:
均可免费领取自动数据挖掘与分析实操完整代码。代码git地址:
https://github.com/JavaLeb/auto_cda
自动数据分析模块主要分为:
1、数据接入
2、数据探索
3、数据处理
4、数据切分
5、数据模型(模型选择、数据建模、模型评估、模型调优、模型预测)。
项目部分结构如下:


项目指南
项目中提供了多个样例,方便读者使用。下面以项目中提供的pm25预测为例说明。先给出完整代码(详见example/pm25_example.py):
from data_reader import DataReader
from data_explorer import DataExplorer
from data_splitter import DataSplitter
from data_processor import DataProcessor
from data_modeler import DataModeler
from data_configuration import Configuration
def auto_pm25():
conf = Configuration(conf_path=r'../conf/pm25_ml_config.yml')
# 数据读取
data_reader = DataReader(ds_type='file', conf=conf)
data = data_reader.read()
# 数据探索.
data_explorer = DataExplorer(data, conf=conf)
data_explorer.explore()
# 数据处理.
data_processor = DataProcessor(conf=conf)
data = data_processor.process(data)
# 数据探索.
data_explorer = DataExplorer(data, conf=conf)
data_explorer.explore()
data_splitter = DataSplitter(conf=conf)
train_data_list, valid_data_list = data_splitter.split(data)
train_data, valid_data = train_data_list[0], valid_data_list[0]
# 数据建模.
data_modeler = DataModeler(conf=conf)
data_modeler.model(train_data, valid_data)
# 读取测试数据.
test_data = data_reader.read(train=False)
# 探索测试数据.
data_explorer = DataExplorer(test_data, conf=conf)
data_explorer.explore()
test_processed_data = data_processor.process(test_data)
# 模型预测并保存预测结果.
data_modeler.save_predict(test_data, test_processed_data)
if __name__ == '__main__':
auto_pm25()
数据读取
数据详见项目目录:data/pm25,目录下有训练数据pm25_train.csv和测试数据pm25_test.csv。训练数据文件内容如图:

使用本项目,三行代码即可完成数据读取,代码如下:
conf = Configuration(conf_path=r'../conf/pm25_ml_config.yml')
# 数据读取
data_reader = DataReader(ds_type='file', conf=conf)
data = data_reader.read()
数据探索
数据探索可对数据的分布、变量关系等进行数据处理前的探索分析,使用本项目两行代码完成数据探索,代码如下:
data_explorer = DataExplorer(data, conf=conf)
data_explorer.explore()
部分结果输出如图:




========================================================================================================================================================================================================
前10行数据(shape:(33746, 13))
| | date | hour | pm2.5 | DEWP | TEMP | PRES | Iws | Is | Ir | cbwd_NE | cbwd_NW | cbwd_SE | cbwd_cv |
|---:|------------:|-----------:|--------:|----------:|-----------:|-----------:|-----------:|-----------:|----------:|----------:|----------:|----------:|----------:|
| 0 | -0.093972 | 1.22689 | 15 | 0.849475 | 0.62083 | -0.330651 | 0.180401 | -0.0725599 | -0.136961 | -0.3597 | 1.4593 | -0.734081 | -0.524471 |
| 1 | 0.516237 | -1.66044 | 50 | -0.196481 | -0.283649 | -1.30562 | 0.780893 | -0.0725599 | -0.136961 | -0.3597 | -0.68526 | 1.36225 | -0.524471 |
| 2 | -1.10593 | -1.37171 | 112 | -0.893786 | -1.4348 | -0.0381611 | -0.447357 | -0.0725599 | -0.136961 | -0.3597 | -0.68526 | -0.734081 | 1.90668 |
| 3 | 0.394954 | -1.22734 | 20 | -1.45163 | -1.76371 | 1.71678 | -0.310817 | -0.0725599 | -0.136961 | 2.78009 | -0.68526 | -0.734081 | -0.524471 |
| 4 | 0.823237 | 1.08252 | 19 | -0.266212 | 0.62083 | 0.0593356 | 0.0985994 | -0.0725599 | -0.136961 | -0.3597 | -0.68526 | 1.36225 | -0.524471 |
| 5 | 0.571194 | -1.51607 | 79 | 0.152171 | 0.538605 | -0.330651 | -0.465468 | -0.0725599 | -0.136961 | 2.78009 | -0.68526 | -0.734081 | -0.524471 |
| 6 | 1.55284 | -0.361142 | 62 | 0.221901 | 0.0452525 | -0.233154 | -0.447154 | -0.0725599 | -0.136961 | 2.78009 | -0.68526 | -0.734081 | -0.524471 |
| 7 | -0.00490415 | 1.08252 | 32 | 1.05867 | 1.44308 | -1.20812 | -0.310614 | -0.0725599 | -0.136961 | -0.3597 | -0.68526 | 1.36225 | -0.524471 |
| 8 | 1.13403 | -0.0724098 | 196 | -0.475403 | -0.612551 | 0.741812 | -0.429043 | -0.0725599 | -0.136961 | -0.3597 | -0.68526 | -0.734081 | 1.90668 |
| 9 | 0.093639 | -0.938607 | 25 | 0.710014 | 0.0452525 | 0.0593356 | -0.41073 | -0.0725599 | -0.136961 | -0.3597 | 1.4593 | -0.734081 | -0.524471 |
========================================================================================================================================================================================================
|相关性|>=0.5的变量:
level_0 level_1 value
2 TEMP PRES -0.825338
0 DEWP TEMP 0.822793
1 DEWP PRES -0.775286
3 cbwd_NW cbwd_SE -0.503037
========================================================================================================================================================================================================
数据的摘要信息:
| | total-columns | total-lines | dtypes | ftypes | memory-usage | total-missing-field | total-duplicate-value |
|---:|----------------:|--------------:|:------------|:---------------------|:---------------|----------------------:|------------------------:|
| 0 | 13 | 33746 | float64(13) | 13(VALUE:3,CLASS:10) | 3.3 MB | 0 | 0 |
========================================================================================================================================================================================================
数据列的摘要信息:
| | Non-Null-Count | Dtype | Total-Count | Unique-Count | Unique-Count-Ratio | count_desc | mean_desc | std_desc | min_desc | 25%_desc | 50%_desc | 75%_desc | max_desc | Ftype | Missing-Value-Count | First-Missing-Index |
|:--------|-----------------:|:--------|--------------:|---------------:|---------------------:|-------------:|-------------:|-----------:|-----------:|-----------:|-----------:|------------:|-----------:|:--------|----------------------:|:----------------------|
| date | 33746 | float64 | 33746 | 1533 | 0.0454276 | 33746 | -1.29713e-15 | 1.00001 | -1.75783 | -0.853891 | 0.0121514 | 0.866823 | 1.69875 | VALUE | 0 | None |
| hour | 33746 | float64 | 33746 | 24 | 0.000711195 | 33746 | -2.41087e-17 | 1.00001 | -1.66044 | -0.938607 | 0.0719563 | 0.938153 | 1.65998 | CLASS | 0 | None |
| pm2.5 | 33746 | float64 | 33746 | 568 | 0.0168316 | 33746 | 98.6642 | 91.9102 | 0 | 29 | 72 | 137 | 994 | VALUE | 0 | None |
| DEWP | 33746 | float64 | 33746 | 69 | 0.00204469 | 33746 | -8.94864e-19 | 1.00001 | -2.91597 | -0.824055 | 0.01271 | 0.919206 | 1.8257 | CLASS | 0 | None |
| TEMP | 33746 | float64 | 33746 | 63 | 0.00186689 | 33746 | 2.31612e-18 | 1.00001 | -2.58596 | -0.859227 | 0.127478 | 0.867506 | 2.34756 | CLASS | 0 | None |
| PRES | 33746 | float64 | 33746 | 58 | 0.00171872 | 33746 | 1.03341e-15 | 1.00001 | -2.37808 | -0.818135 | -0.0381611 | 0.839309 | 2.88674 | CLASS | 0 | None |
| Iws | 33746 | float64 | 33746 | 2452 | 0.0726605 | 33746 | 8.35908e-17 | 1.00001 | -0.474421 | -0.447154 | -0.374305 | -0.0377372 | 11.0234 | VALUE | 0 | None |
| Is | 33746 | float64 | 33746 | 27 | 0.000800095 | 33746 | 5.68502e-18 | 1.00001 | -0.0725599 | -0.0725599 | -0.0725599 | -0.0725599 | 33.4877 | CLASS | 0 | None |
| Ir | 33746 | float64 | 33746 | 36 | 0.00106679 | 33746 | 2.21084e-18 | 1.00001 | -0.136961 | -0.136961 | -0.136961 | -0.136961 | 24.3427 | CLASS | 0 | None |
| cbwd_NE | 33746 | float64 | 33746 | 2 | 5.92663e-05 | 33746 | 2.73197e-17 | 1.00001 | -0.3597 | -0.3597 | -0.3597 | -0.3597 | 2.78009 | CLASS | 0 | None |
| cbwd_NW | 33746 | float64 | 33746 | 2 | 5.92663e-05 | 33746 | 6.41143e-17 | 1.00001 | -0.68526 | -0.68526 | -0.68526 | 1.4593 | 1.4593 | CLASS | 0 | None |
| cbwd_SE | 33746 | float64 | 33746 | 2 | 5.92663e-05 | 33746 | 5.34813e-17 | 1.00001 | -0.734081 | -0.734081 | -0.734081 | 1.36225 | 1.36225 | CLASS | 0 | None |
| cbwd_cv | 33746 | float64 | 33746 | 2 | 5.92663e-05 | 33746 | 6.96941e-17 | 1.00001 | -0.524471 | -0.524471 | -0.524471 | -0.524471 | 1.90668 | CLASS | 0 | None |
========================================================================================================================================================================================================
数据的详细信息:
__________________________________________________
类别字段[hour]频率和占比分析:
(类别数:24)
hour count proportion
0 -1.227339 1423 0.042168
1 1.659984 1422 0.042138
2 0.360689 1420 0.042079
3 0.649421 1416 0.041961
4 -0.505508 1416 0.041961
5 -1.660438 1416 0.041961
6 1.371252 1414 0.041901
7 0.505055 1413 0.041872
8 -0.649874 1412 0.041842
9 -1.082973 1411 0.041812
10 1.082519 1406 0.041664
11 0.793787 1406 0.041664
12 0.938153 1405 0.041635
13 0.216323 1403 0.041575
14 -0.361142 1402 0.041546
15 -1.516071 1402 0.041546
16 -1.371705 1400 0.041486
17 1.226886 1399 0.041457
18 -0.216776 1399 0.041457
19 -0.794241 1396 0.041368
20 -0.938607 1392 0.041249
21 1.515618 1392 0.041249
22 -0.072410 1391 0.041220
23 0.071956 1390 0.041190
__________________________________________________
类别字段[DEWP]频率和占比分析:
(类别数:69)
DEWP count proportion
0 1.128397 1105 0.032745
1 1.058667 1019 0.030196
2 1.198127 993 0.029426
3 0.988936 982 0.029100
4 1.267858 929 0.027529
.. ... ... ...
64 -2.567316 3 0.000089
65 -2.706777 3 0.000089
66 -2.776508 2 0.000059
67 -2.846238 1 0.000030
68 -2.915969 1 0.000030
[69 rows x 3 columns]
__________________________________________________
类别字段[TEMP]频率和占比分析:
(类别数:63)
TEMP count proportion
0 0.949732 1190 0.035263
1 0.867506 1150 0.034078
2 0.785281 1115 0.033041
3 0.703056 1084 0.032122
4 1.031957 1062 0.031470
.. ... ... ...
58 2.347564 4 0.000119
59 2.183113 2 0.000059
60 -2.585960 2 0.000059
61 -0.256241 1 0.000030
62 0.182295 1 0.000030
[63 rows x 3 columns]
__________________________________________________
类别字段[PRES]频率和占比分析:
(类别数:58)
PRES count proportion
0 -0.233154 1124 0.033308
1 -1.013128 1118 0.033130
2 0.839309 1110 0.032893
3 -0.428148 1063 0.031500
4 0.644316 1059 0.031381
5 -0.330651 1047 0.031026
6 -0.915631 1034 0.030641
7 0.059336 1019 0.030196
8 0.351826 1019 0.030196
9 -0.135658 1015 0.030078
10 -0.525645 999 0.029604
11 0.156832 997 0.029544
12 -0.720638 997 0.029544
13 0.936806 992 0.029396
14 0.449322 982 0.029100
15 -0.818135 978 0.028981
16 0.741812 971 0.028774
17 1.034302 958 0.028389
18 -0.038161 957 0.028359
19 -1.110625 954 0.028270
20 0.254329 951 0.028181
21 -1.305618 946 0.028033
22 -0.623141 946 0.028033
23 0.546819 936 0.027737
24 -1.208121 927 0.027470
25 1.131799 884 0.026196
26 -1.403115 861 0.025514
27 1.229296 855 0.025336
28 1.326792 736 0.021810
29 -1.500611 619 0.018343
30 1.424289 586 0.017365
31 1.521786 584 0.017306
32 -1.598108 489 0.014491
33 1.619282 477 0.014135
34 1.716779 386 0.011438
35 1.814276 359 0.010638
36 -1.695605 331 0.009809
37 -1.793101 322 0.009542
38 1.911772 228 0.006756
39 -1.890598 176 0.005215
40 2.009269 168 0.004978
41 -1.988095 118 0.003497
42 2.106766 106 0.003141
43 -2.085591 92 0.002726
44 2.204263 75 0.002222
45 2.301759 70 0.002074
46 -2.183088 49 0.001452
47 2.399256 34 0.001008
48 -2.280585 15 0.000444
49 2.496753 15 0.000444
50 2.691746 2 0.000059
51 -2.378081 2 0.000059
52 2.594249 2 0.000059
53 2.886739 2 0.000059
54 0.303077 1 0.000030
55 2.789243 1 0.000030
56 1.294294 1 0.000030
57 1.554285 1 0.000030
__________________________________________________
类别字段[Is]频率和占比分析:
(类别数:27)
Is count proportion
0 -0.072560 33441 0.990962
1 1.218219 55 0.001630
2 2.508998 33 0.000978
3 3.799777 30 0.000889
4 5.090556 27 0.000800
5 6.381335 25 0.000741
6 7.672114 24 0.000711
7 10.253672 18 0.000533
8 8.962893 17 0.000504
9 11.544450 14 0.000415
10 12.835229 11 0.000326
11 14.126008 8 0.000237
12 15.416787 6 0.000178
13 16.707566 5 0.000148
14 17.998345 5 0.000148
15 19.289124 4 0.000119
16 20.579903 4 0.000119
17 24.452240 3 0.000089
18 21.870682 3 0.000089
19 29.615355 2 0.000059
20 27.033798 2 0.000059
21 25.743019 2 0.000059
22 23.161461 2 0.000059
23 28.324577 2 0.000059
24 32.196913 1 0.000030
25 33.487692 1 0.000030
26 30.906134 1 0.000030
__________________________________________________
类别字段[Ir]频率和占比分析:
(类别数:36)
Ir count proportion
0 -0.136961 32353 0.958721
1 0.543030 393 0.011646
2 1.223021 245 0.007260
3 1.903012 169 0.005008
4 2.583004 102 0.003023
5 3.262995 89 0.002637
6 3.942986 70 0.002074
7 4.622978 59 0.001748
8 5.302969 43 0.001274
9 5.982960 32 0.000948
10 6.662952 27 0.000800
11 7.342943 24 0.000711
12 8.022934 21 0.000622
13 8.702925 16 0.000474
14 9.382917 12 0.000356
15 11.422891 9 0.000267
16 12.782873 9 0.000267
17 10.742899 9 0.000267
18 13.462864 8 0.000237
19 10.062908 8 0.000237
20 14.822847 7 0.000207
21 15.502838 7 0.000207
22 14.142856 6 0.000178
23 12.102882 6 0.000178
24 16.182830 3 0.000089
25 16.862821 3 0.000089
26 17.542812 2 0.000059
27 20.262777 2 0.000059
28 20.942769 2 0.000059
29 19.582786 2 0.000059
30 18.222804 2 0.000059
31 21.622760 2 0.000059
32 23.662734 1 0.000030
33 24.342725 1 0.000030
34 18.902795 1 0.000030
35 22.982743 1 0.000030
__________________________________________________
类别字段[cbwd_NE]频率和占比分析:
(类别数:2)
cbwd_NE count proportion
0 -0.359700 29880 0.885438
1 2.780093 3866 0.114562
__________________________________________________
类别字段[cbwd_NW]频率和占比分析:
(类别数:2)
cbwd_NW count proportion
0 -0.68526 22963 0.680466
1 1.45930 10783 0.319534
__________________________________________________
类别字段[cbwd_SE]频率和占比分析:
(类别数:2)
cbwd_SE count proportion
0 -0.734081 21929 0.649825
1 1.362247 11817 0.350175
__________________________________________________
类别字段[cbwd_cv]频率和占比分析:
(类别数:2)
cbwd_cv count proportion
0 -0.524471 26466 0.784271
1 1.906683 7280 0.215729
__________________________________________________
数据的相关性矩阵:
| | date | hour | pm2.5 | DEWP | TEMP | PRES | Iws | Is | Ir | cbwd_NE | cbwd_NW | cbwd_SE | cbwd_cv |
|:--------|-------------:|------------:|-----------:|-----------:|------------:|-----------:|------------:|------------:|------------:|------------:|-----------:|-------------:|------------:|
| date | 1 | -0.00147873 | -0.0255648 | 0.0538985 | 0.0870263 | -0.0258579 | -0.0603007 | -0.026741 | -0.0136163 | 0.00462793 | -0.0439984 | 0.000837741 | 0.0453233 |
| hour | -0.00147873 | 1 | -0.0287458 | -0.025113 | 0.149418 | -0.0406005 | 0.0654733 | -0.005116 | -0.00859084 | -0.0685987 | -0.122566 | 0.211881 | -0.0536609 |
| pm2.5 | -0.0255648 | -0.0287458 | 1 | 0.16623 | -0.0943591 | -0.0443735 | -0.248153 | 0.0248505 | -0.055547 | -0.0341627 | -0.209001 | 0.0901709 | 0.158811 |
| DEWP | 0.0538985 | -0.025113 | 0.16623 | 1 | 0.822793 | -0.775286 | -0.289432 | -0.0359728 | 0.127166 | -0.0333757 | -0.333827 | 0.27162 | 0.0892787 |
| TEMP | 0.0870263 | 0.149418 | -0.0943591 | 0.822793 | 1 | -0.825338 | -0.149866 | -0.096841 | 0.0509266 | -0.0606246 | -0.265371 | 0.307028 | -0.00829014 |
| PRES | -0.0258579 | -0.0406005 | -0.0443735 | -0.775286 | -0.825338 | 1 | 0.174242 | 0.0716255 | -0.0846744 | 0.0635018 | 0.225794 | -0.24731 | -0.0183287 |
| Iws | -0.0603007 | 0.0654733 | -0.248153 | -0.289432 | -0.149866 | 0.174242 | 1 | 0.0174077 | -0.00631762 | -0.115665 | 0.360509 | -0.0771978 | -0.229599 |
| Is | -0.026741 | -0.005116 | 0.0248505 | -0.0359728 | -0.096841 | 0.0716255 | 0.0174077 | 1 | -0.00993791 | -0.00556331 | -0.0221607 | 0.0361405 | -0.012483 |
| Ir | -0.0136163 | -0.00859084 | -0.055547 | 0.127166 | 0.0509266 | -0.0846744 | -0.00631762 | -0.00993791 | 1 | 0.0407016 | 0.0293042 | -0.0385724 | -0.0200026 |
| cbwd_NE | 0.00462793 | -0.0685987 | -0.0341627 | -0.0333757 | -0.0606246 | 0.0635018 | -0.115665 | -0.00556331 | 0.0407016 | 1 | -0.246488 | -0.264049 | -0.188652 |
| cbwd_NW | -0.0439984 | -0.122566 | -0.209001 | -0.333827 | -0.265371 | 0.225794 | 0.360509 | -0.0221607 | 0.0293042 | -0.246488 | 1 | -0.503037 | -0.359399 |
| cbwd_SE | 0.000837741 | 0.211881 | 0.0901709 | 0.27162 | 0.307028 | -0.24731 | -0.0771978 | 0.0361405 | -0.0385724 | -0.264049 | -0.503037 | 1 | -0.385004 |
| cbwd_cv | 0.0453233 | -0.0536609 | 0.158811 | 0.0892787 | -0.00829014 | -0.0183287 | -0.229599 | -0.012483 | -0.0200026 | -0.188652 | -0.359399 | -0.385004 | 1 |
数据处理
使用本项目数据处理,可完成数据清洗、转换、编码等,代码如下:
# 数据处理.
data_processor = DataProcessor(conf=conf)
data = data_processor.process(data)
数据切分
数据切分主要是将训练数据切分为训练集和验证集。支持各种数据切分,包括简单交叉验证、K折交叉验证、留一法交叉验证、留P法交叉验证等,代码:
data_splitter = DataSplitter(conf=conf)
train_data_list, valid_data_list = data_splitter.split(data)
train_data, valid_data = train_data_list[0], valid_data_list[0]
数据划分摘要:
| | splitter | train_size | shuffle | random_state | total_count | train_count | valid_count |
|---:|:-----------|-------------:|:----------|---------------:|--------------:|--------------:|--------------:|
| 0 | simple | 0.7 | True | 42 | 33746 | 23622 | 10124 |
数据模型
数据模型主要功能有模型选择、模型训练、模型评估、模型调优等,代码如下:
# 数据建模.
data_modeler = DataModeler(conf=conf)
data_modeler.model(train_data, valid_data)
数据模型将通过自动调优选择最佳模型,部分输出如下:
数据模型摘要:
| | model | best_params_model | valid-root_mean_squared_error | train-root_mean_squared_error |
|---:|:---------------------------------------------------------------|:---------------------------------------------------------------|--------------------------------:|--------------------------------:|
| 0 | LinearRegression() | LinearRegression() | 79.0689 | 78.6081 |
| 1 | LinearSVR() | LinearSVR() | 83.4089 | 83.0077 |
| 2 | GridSearchCV(estimator=SVR(), param_grid={'kernel': ['poly']}) | GridSearchCV(estimator=SVR(), param_grid={'kernel': ['poly']}) | 85.683 | 84.9634 |
========================================================================================================================================================================================================
最佳数据模型摘要:
| | assessment | model | model_params |
|---:|:------------------------|:-------------------|:---------------------------------------------------------------------------|
| 0 | root_mean_squared_error | LinearRegression() | {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False} |
938

被折叠的 条评论
为什么被折叠?



