scikit-learn特征降维

为什么要特征降维

特征降维是指将数据集中的特征数(或称为维度)降低的过程,这个过程有助于简化数据结构,同时去除冗余特征和噪声,提高模型的运行效率和预测能力。其优点主要包括以下几点:

  • 减少计算量:降低特征数量会减少模型训练时所需的计算资源,特别是对于高维数据集来说,这可以显著减少训练时间。

  • 防止过拟合:通过去除一些不重要或者冗余的特征,可以减少模型对训练数据的过度依赖,从而提高模型在新数据上的泛化能力。

  • 数据可视化:在高维空间中,数据的可视化非常困难。通过特征降维,可以将数据投影到二维或三维空间中,使得数据的可视化和分析变得更加直观。

  • 去除噪声:高维数据往往包含较多的噪声,特征降维可以帮助去除一些无关紧要的特征,留下对模型最有帮助的信息。

  • 提高模型性能:一些降维技术(如线性判别分析LDA)能够通过寻找最佳的特征组合来提升分类或回归任务的性能。

降维

降维是指在某些限定条件下,降低随机变量(特征)个数,得到一组“不相关”主变量的过程

降维的两种方式

  • 特征选择
  • 主成分分析(可以理解一种特征提取的方式)

什么是特征选择

定义

数据中包含冗余或相关变量(或称特征、属性、指标等),旨在从原有特征中找出主要特征
在这里插入图片描述

方法

Filter过滤式:主要探究特征本身特点、特征与特征和目标值之间关联

  • 方差选择法:低方差特征过滤
  • 相关系数:特征与特征之间的相关程度

Embedded嵌入式:算法自动选择特征(特征与目标值之间的关联)

  • 决策树:信息熵、信息增益
  • 正则化:L1,L2
  • 深度学习:卷积等

模块

sklearn.feature_selection

测试数据

index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense,date,return
0,000001.XSHE,5.9572,1.1818,85252550922.0,0.8008,14.9403,1211444855670.0,2.01,20701401000.0,10882540000.0,2012-01-31,0.027657228229937388
1,000002.XSHE,7.0289,1.588,84113358168.0,1.6463,7.8656,300252061695.0,0.326,29308369223.2,23783476901.2,2012-01-31,0.08235182370820669
2,000008.XSHE,-262.7461,7.0003,517045520.0,-0.5678,-0.5943,770517752.56,-0.006,11679829.03,12030080.04,2012-01-31,0.09978900335112327
3,000060.XSHE,16.476,3.7146,19680455995.0,5.6036,14.617,28009159184.6,0.35,9189386877.65,7935542726.05,2012-01-31,0.12159482758620697
4,000069.XSHE,12.5878,2.5616,41727214853.0,2.8729,10.9097,81247380359.0,0.271,8951453490.28,7091397989.13,2012-01-31,-0.0026808154146886697
5,000100.XSHE,10.796,1.5219999999999998,17206724233.0,2.245,7.7394,66034033386.1,0.0974,43883757748.0,43092263405.0,2012-01-31,0.13795588072275808
6,000402.XSHE,8.1032,1.0078,18253291248.0,3.2233,10.3965,56511787330.1,0.6,7303544153.27,5440677926.67,2012-01-31,0.08457869004968227
7,000413.XSHE,192.5234,18.3692,4270450000.0,-0.6423,-1.0375,4363236328.83,-0.0059,62630800.67,64784767.7,2012-01-31,0.11926833305744075
8,000415.XSHE,18.0444,1.2301,8699264600.0,1.8112,3.7845,21846364582.5,0.3197,1466188261.63,1036862316.81,2012-01-31,0.188558875748548
9,000423.XSHE,23.3689,7.7135,26409389664.0,16.1098,20.7633,27897612682.3,0.9372,1862915702.43,1157644109.36,2012-01-31,0.08766839826138167
10,000425.XSHE,8.6344,2.1796,33004130464.0,9.9592,21.1151,53846101659.4,1.37,25233044684.7,21955101570.5,2012-01-31,0.0024943086212016125
11,000503.XSHE,226.9945,4.3887,5410909668.0,0.1796,0.3275,6249917784.47,0.0045,99694183.97,135382267.29,2012-01-31,0.16777408637873767
12,000538.XSHE,22.6896,6.1087,33935745494.0,10.2038,17.8649,43071657562.5,1.24,7741652400.02,6769161248.85,2012-01-31,0.0654679840686276
13,000540.XSHE,17.6816,4.0655,8759864200.0,2.3934,17.664,22467665936.4,0.2538,2004355082.02,1749636830.11,2012-01-31,0.19704350036003213
14,000559.XSHE,18.3311,2.3941,9256861365.0,4.7469,9.2005,14628454817.0,0.217,6200943486.15,5822509580.28,2012-01-31,0.13942307692307696
15,000568.XSHE,14.3531,7.2173,51740226954.0,24.1712,34.1465,56573570954.2,1.44,5628271589.15,2856713520.1,2012-01-31,0.0964713454704008
16,000623.XSHE,5.6704,1.7148,15460024303.0,20.7774,23.1428,18804075329.3,2.7168,809481664.08,690911564.42,2012-01-31,0.11214716945348802
17,000625.XSHE,-203.3888,1.3678,20159792217.0,2.7517,7.0257,37217163392.5,0.19,19606549988.7,20002232604.2,2012-01-31,0.1367074829931972
18,000627.XSHE,-16.8256,2.8326,3898338814.0,-2.728,-3.1988,5262950195.63,-0.036000000000000004,797135024.42,832484948.18,2012-01-31,0.1562478197167375
19,000630.XSHE,14.5249,2.8489999999999998,28446350207.0,3.2688,12.367,49261848233.6,0.74,52032837531.5,50898693656.5,2012-01-31,0.09244543062077502
20,000651.XSHE,8.8321,3.1092,54743150990.0,4.9914,25.6552,125069492621.0,1.3388,64074612708.6,61158359767.8,2012-01-31,0.09449989694667722
21,000671.XSHE,54.5916,1.7354,3484036043.0,0.8814,5.9695,11761758216.0,0.19,1828169582.43,1706703767.9,2012-01-31,0.17388986213725338
22,000686.XSHE,-10.5648,2.865,8771366787.0,-0.1107,-0.5469,20765182199.3,-0.03,677592524.5,717833927.89,2012-01-31,0.12170532632190215
23,000709.XSHE,30.7015,0.7431,31749624178.0,0.8678,2.8371,99804099745.6,0.15,107770205166.0,106446201521.0,2012-01-31,0.08023627861186305
24,000723.XSHE,83.1601,5.1914,2447173888.0,3.7628,6.7973,2643902662.49,0.23,1211363701.77,1164178657.88,2012-01-31,0.042778608589422264
25,000725.XSHE,7.8303,0.9565,24473991637.0,-3.8791,-8.9859,59972852071.9,-0.16,8514636554.0,10928202542.0,2012-01-31,0.07731582786287378
26,000728.XSHE,89.0694,1.2107,17853669000.0,1.9189,3.093,27584467396.7,0.2311,1315052595.51,759062567.21,2012-01-31,0.07700600919052664
27,000738.XSHE,48.6018,3.7755,9352957791.0,3.2868,5.6005,12090522078.9,0.1412,1350949551.65,1208450823.91,2012-01-31,0.1542375123375765
28,000750.XSHE,-25.6794,3.0579,7848747888.0,1.0709,1.2462,20293252211.3,0.05,846043430.3,770218550.33,2012-01-31,0.19820142453543432
29,000768.XSHE,-147.1629,2.0306,19102438172.0,0.1931,0.3771,32166350682.9,0.0142,5081888777.39,5059579936.33,2012-01-31,0.1465590484282072
30,000776.XSHE,728.4289,2.3480000000000003,73103249580.0,1.7866,6.13,136014621813.0,0.61,4325459005.79,2377388304.37,2012-01-31,0.07489705895839265
31,000783.XSHE,-45.0677,1.6385,18614185636.0,1.0097,3.1584,38751827912.8,0.14,1337998541.45,967274968.36,2012-01-31,0.12992671244734258
32,000792.XSHE,19.2172,3.7855,51771074558.0,10.514000000000001,19.4208,80915739549.4,1.8385,5539622881.41,2862752126.27,2012-01-31,0.15145999313841346
33,000826.XSHE,23.0132,5.7688,9086509777.0,5.8911,13.970999999999998,12253603099.7,0.462,1024996030.27,798413138.46,2012-01-31,0.08771129113758985
34,000839.XSHE,333.106,1.8536,10756003511.0,0.89,1.5777,18870505927.6,0.0583,1180163422.97,1267800326.91,2012-01-31,0.12829722318773407
35,000858.XSHE,15.3332,5.2524,121432975373.0,16.1359,24.1627,146746453414.0,1.2690000000000001,15648774490.2,9035308334.24,2012-01-31,0.1015933571894008
36,000876.XSHE,7.676,3.0819,29540383370.0,7.4287,14.0505,20968360549.5,1.03,50764868718.2,49120061669.7,2012-01-31,0.06176845044971034
37,000895.XSHE,-1338.4485,10.6955,39377548602.0,4.5209,7.8543,42437138787.3,0.4592,24239106613.4,23337169830.8,2012-01-31,0.062017805916695874
38,000938.XSHE,42.6661,2.8149,2254515200.0,1.6298,4.2112,3529790114.27,0.162,3644404043.75,3629839989.02,2012-01-31,0.1014640474425501
39,000959.XSHE,-26.1079,1.1222,8513929784.0,0.0867,2.0863,21615136271.8,0.0549,9502575535.22,9992265480.22,2012-01-31,0.08711129724636166
40,000961.XSHE,8.7321,1.8054,9809849498.0,2.4954,13.1395,31471593062.6,0.55,7903065272.29,7025337345.32,2012-01-31,0.1964263632991221
41,000963.XSHE,22.4888,6.5556,10022445192.0,7.7243,22.5682,14715366599.7,0.6757,8194944377.07,7738330113.27,2012-01-31,0.08012232979301576
42,000983.XSHE,16.8509,3.6087,50324664000.0,8.237,18.4178,85342188907.5,0.7948,22884339585.8,19205385113.0,2012-01-31,0.0776470137238366
43,002007.XSHE,103.2077,5.627999999999999,13321854976.0,11.9395,13.7822,12957134001.1,0.535,738447615.67,363505122.7,2012-01-31,0.06531210999934314
44,002008.XSHE,9.4767,2.6381,7404771894.0,5.7164,11.6348,11923045459.0,0.2756,2673513831.05,2362243445.18,2012-01-31,0.16502264030126082
45,002024.XSHE,10.8124,2.7260000000000004,60867043234.0,7.0214,17.3106,105525576525.0,0.4892,67625499000.0,63105122000.0,2012-01-31,0.034480285024270016
46,002027.XSHE,-10.2187,2.016,1218410517.0,0.1718,0.257,1805997930.82,0.0056,947381266.9,941843709.87,2012-01-31,0.09178151381773576
47,002044.XSHE,20.4515,5.1103,1935277500.0,6.5053,9.2073,2585540321.99,0.139,579326555.55,536804411.89,2012-01-31,0.11009951302138482
48,002065.XSHE,18.7782,4.3444,9850608640.0,9.892999999999999,14.5688,11373623438.7,0.5093,1606056451.02,1335501978.86,2012-01-31,0.11099090669863453
49,002074.XSHE,34.6532,3.6556,1578482640.0,5.7557,10.5459,2441487011.21,0.18,493069178.84,440614415.3,2012-01-31,0.11717714761670385
50,002081.XSHE,17.0643,5.9058,18377040393.0,9.7896,31.9623,21684655943.0,0.9086,6432177806.42,5839048120.1,2012-01-31,0.139324831207802
51,002142.XSHE,6.6771,1.4932,27944220926.0,1.0298,15.1654,240797776509.0,0.89,5706374000.0,2764872000.0,2012-01-31,0.03405619746902688
52,002146.XSHE,7.3613,2.4509,15039897600.0,3.9236,18.5606,33968718457.5,0.51,6101082478.69,4839724494.1,2012-01-31,0.19950759589290207
53,002153.XSHE,23.5333,6.7642,7397241600.0,15.0778,17.1945,8854453139.38,0.52,459764525.26,302985073.79,2012-01-31,0.11158221038069367
54,002174.XSHE,-135.1836,3.952,902386340.0,0.1001,0.77,1344172473.18,0.021,226840813.26,228325157.3,2012-01-31,0.16452606174928575
55,002230.XSHE,45.8019,6.3889,7516943625.0,6.4937,7.8906,7670455233.62,0.28,342597396.99,285178588.37,2012-01-31,0.31656126844468896
56,002236.XSHE,23.7544,7.819,11774402076.0,11.0706,16.345,12402992370.4,0.73,1460699765.06,1234300793.67,2012-01-31,0.10240671551263997
57,002241.XSHE,21.875,8.175,17053408060.0,8.8802,19.7717,20333004124.0,0.47,2753719950.56,2332580270.48,2012-01-31,0.1467561303706585
58,002252.XSHE,29.9917,7.0569,6454560000.0,15.0026,16.5791,6584240682.17,0.51,381696427.88,218311705.2,2012-01-31,0.06362922230950507
59,002292.XSHE,75.7295,7.325,10104832000.0,5.7157,6.7608,7207363453.14,0.22,727006274.15,623049692.79,2012-01-31,0.09363440005958447
60,002294.XSHE,19.4133,4.1442,8836656000.0,12.5764,14.765999999999998,11936854986.3,0.79,1077149060.94,746256481.74,2012-01-31,0.13235003881425728
61,002304.XSHE,19.6996,11.6265,114993000000.0,22.9515,37.134,127056431081.0,3.24,9697320356.4,5655841725.15,2012-01-31,0.16850501935668302
62,002310.XSHE,17.5595,6.0159,11263785183.0,13.5919,22.8921,14699468220.1,2.29,1965187858.05,1566421577.31,2012-01-31,0.2306324150615437
63,002352.XSHE,26.5542,2.1836,1569068525.0,3.2224,4.3585,2073680968.64,0.4,470163824.47,438363251.57,2012-01-31,0.08977645112098892
64,002385.XSHE,16.6065,3.3233,11563080000.0,7.7024,9.8108,14559606885.4,0.78,5279408885.11,4905932347.5,2012-01-31,0.17157582281371295
65,002411.XSHE,27.4205,3.0469,2356830000.0,5.2652,5.4527,3020826515.22,0.17,770044002.13,712509134.65,2012-01-31,0.21476753135044588
66,002415.XSHE,19.359,5.4305,36980000000.0,12.7014,14.6805,40368127810.4,0.87,3453156822.06,2470651269.54,2012-01-31,0.11356624917600534
67,002424.XSHE,25.5245,3.4475,6599712000.0,4.5786,6.8463,8407405957.77,0.27,701417429.65,554485320.02,2012-01-31,0.12688254714847808
68,002426.XSHE,1512.7409,2.2273,2886956100.0,3.7132,5.2927,4088639981.93,0.1697,1115660255.45,1062257040.89,2012-01-31,0.14153668399768915
69,002450.XSHE,35.4532,6.6061,6787200000.0,7.5141,11.0486,7508019924.18,0.3207,1098832203.2,969665965.32,2012-01-31,0.1619329045715278
70,002456.XSHE,-263.2222,2.8871,2808960000.0,-1.4387,-2.6004,4688450853.06,-0.13,648674946.38,677435315.07,2012-01-31,0.5242354899105472
71,002460.XSHE,49.731,4.5287,3279000000.0,5.1303,5.8898,4023862473.74,0.28,327739393.91,282919396.3,2012-01-31,0.08829619033278123
72,002465.XSHE,31.4086,1.655,7075738533.0,2.5843,2.9048,8694467710.3,0.3679,658569640.25,549797985.66,2012-01-31,0.15225456190619813
73,002466.XSHE,97.1092,4.1013,4036620000.0,2.9892,3.3418,5192804839.01,0.22,298312158.14,267370981.94,2012-01-31,0.126367785906334
74,002468.XSHE,-29.5372,2.2865,1735680000.0,3.0888,4.7192,2985511165.06,0.19399999999999998,813493135.78,777625233.64,2012-01-31,0.1946935798624116
75,002470.XSHE,17.198,3.1169,9618000000.0,8.6549,13.7957,13161820051.0,0.57,6334742941.75,5843040739.55,2012-01-31,0.06985058208818411
76,002475.XSHE,20.348,4.4072,8110377000.0,10.4286,11.1446,10445640552.3,0.73,1805745307.42,1506388762.01,2012-01-31,0.11219353791199868
77,002500.XSHE,-299.473,2.6293,15694692000.0,0.8005,2.2936,28146131862.0,0.06,764146400.0,593553103.0,2012-01-31,0.09173205556984823
78,002508.XSHE,17.1347,2.7193,4106240000.0,7.013999999999999,8.7418,4245859793.34,0.48,1066937023.89,928033582.34,2012-01-31,0.02867975889558632
79,002555.XSHE,44.4557,1.8685,1429780000.0,6.0338,7.8995,1915721333.26,0.2783,253102516.76,224929800.23,2012-01-31,0.09088986411463024
80,002558.XSHE,27.0092,2.2569,1331610000.0,6.2979,7.4814,1511286005.1,0.48,232718969.01,204381787.74,2012-01-31,0.1045629106528585
81,002572.XSHE,18.3723,2.3778,3242100000.0,9.0828,10.6158,4293748635.4,0.8,629835600.6,531961977.68,2012-01-31,0.1920963172804533
82,002601.XSHE,15.5863,4.3066,8685600000.0,16.0068,27.3354,8390870389.73,4.32,1510362207.38,1126040681.21,2012-01-31,0.049572435656780614
83,002602.XSHE,19.6278,1.5786,2436000000.0,9.5175,12.6759,3363782564.41,0.88,745992572.11,605838766.87,2012-01-31,0.1092073225719181
84,002608.XSHE,21.584,1.617,3198720000.0,3.6252,10.8872,6726447058.26,1.43,1953159642.57,1756742374.55,2012-01-31,0.08272847047550601
85,002624.XSHE,25.5598,3.0336,1475000000.0,12.3847,23.1718,197559732.15,0.66,332677250.53,285246943.45,2012-01-31,0.07322489313273754
86,300003.XSHE,20.9579,4.796,11229960000.0,15.6951,16.8851,13490161497.1,0.4467,678769523.57,268496848.15,2012-01-31,0.05638972305638978
87,300015.XSHE,40.59,6.6348,9073728000.0,8.1282,10.3255,9332816274.92,0.31,966750963.5,777018084.74,2012-01-31,0.029664695572409017
88,300017.XSHE,25.9591,2.3706,1847487146.0,4.5921,4.8207,1922528782.12,0.23,382167189.86,346917794.45,2012-01-31,0.1711732096390993
89,300024.XSHE,26.6647,4.7491,5637680400.0,5.7208,7.3934,6614052715.55,0.27,546261214.18,459245055.06,2012-01-31,0.008442293833007293
90,300027.XSHE,32.433,4.9322,8322048000.0,5.1512,6.5198,9548149074.81,0.17,481297343.95,379519891.28,2012-01-31,0.0915833408401646
91,300033.XSHE,62.4488,1.7983,2010624000.0,4.0142,4.5664,2717977058.33,0.37,167909628.35,118080013.56,2012-01-31,0.2179192555394784
92,300059.XSHE,25.5474,2.1984,3738000000.0,4.8088,5.0589,4223969468.4,0.4,198275042.38,115710250.2,2012-01-31,0.29160031847133777
93,300070.XSHE,21.4611,3.3669,10989132000.0,2.5622,2.7239999999999998,12240238304.7,0.25,368083629.66,275351643.46,2012-01-31,0.16540056076110135
94,300072.XSHE,22.5444,2.6189,2990079800.0,3.6867,5.5583,3763745417.33,0.31,409472534.65,340213144.32,2012-01-31,0.1340454340303897
95,300122.XSHE,61.6859,3.7147,8388000000.0,5.6875,5.9543,11197163180.7,0.32,456652112.14,309831451.78,2012-01-31,0.14450333725952091
96,300124.XSHE,22.6099,4.1521,10396080000.0,10.2343,10.9358,15352248689.5,1.19,789887920.75,498177781.4,2012-01-31,0.09703376716456807
97,300136.XSHE,19.499000000000002,2.9683,2002766800.0,8.8253,9.145,2949590988.74,0.4149,121334857.44,56642914.4,2012-01-31,0.12049127343244985
98,300144.XSHE,26.941,2.8002,7573104000.0,6.309,7.3163,7692714479.5,0.52,387976059.28,152151856.75,2012-01-31,0.112750724162431
99,300251.XSHE,22.2623,2.9725,5315600000.0,8.8568,10.3649,6925068439.68,1.14,413759118.43,290804273.67,2012-01-31,0.16700232378001545
100,600000.XSHG,4.859,1.1557,171985006446.0,0.8475,15.1658,2524302191530.0,1.067,49307426000.0,23080440000.0,2012-01-31,0.034713242753983416

过滤式

低方差特征过滤

删除低方差的一些特征

  • 特征方差小:某个特征大多样本的值比较相近
  • 特征方差大:某个特征很多样本的值都有差别

API

sklearn.feature_selection.VArianceThreshold(threshold=0.0)

删除所有低方差特征

  • Variance.fit_transform(X),
  • X:numpy array格式的数据[m_sample,n_features],
  • 返回值:训练集差异低于threadshold的特征将被删除。默认值是保留非零方差特征,即删除所有样本中具有相同值的特征

数据计算

from sklearn.feature_selection import VarianceThreshold
import pandas as pd


def variance_demo():
    """
    低方差特征过滤
    :return:
    """
    # 1、获取数据
    data = pd.read_csv('factor_returns.csv')
    print('data:\n', data)
    data = data.iloc[:,1:-2]
    print('data:\n', data)

    # 2、实例化一个转换器类
    #transform = VarianceThreshold()
    transform = VarianceThreshold(threshold=10)

    # 3、调用fit_transform
    data_new = transform.fit_transform(data)
    print("data_new\n", data_new, data_new.shape)

    return None

if __name__ == "__main__":
    variance_demo()

在这里插入图片描述

相关系数

皮尔逊相关系数(Pearson Correlation Coefficient):反映特征之间相关关系密切程度的统计指标

在这里插入图片描述
在这里插入图片描述

特点

相关系数的值介于-1与+1之间,即-1<=r<=+1。其性质如下:

  • 当r>0时,表示两变量正相关;r<0时,两变量为负相关
  • 当|r|=1时,表示两变量为完全相关;当r=0时,表示两变量间无相关关系
  • 当0<|r|<1时,表示两变量存在一定程度的相关。且|r|越接近1,两变量间线性关系越密切;|r|越接近0,表示两变量的线性相关越弱
  • 一般可按三级划分:|r|<0.4为低度相关;0.4<=|r|<0.7为显著相关;0.7<=|r|<1为高维线性相关

API

from scipy.stats import pearsonr
  • x:(N.)array_like
  • y:(N.)array_like Returns:(Perason’s correlation coefficient, p-value)
from sklearn.feature_selection import VarianceThreshold
from scipy.stats import pearsonr
import pandas as pd

def variance_demo():
    """
    低方差特征过滤
    :return:
    """
    # 1、获取数据
    data = pd.read_csv('factor_returns.csv')
    print('data:\n', data)
    data = data.iloc[:,1:-2]
    print('data:\n', data)

    # 2、实例化一个转换器类
    #transform = VarianceThreshold()
    transform = VarianceThreshold(threshold=10)

    # 3、调用fit_transform
    data_new = transform.fit_transform(data)
    print("data_new\n", data_new, data_new.shape)

    # 计算两个变量之间的相关系数
    r = pearsonr(data["pe_ratio"],data["pb_ratio"])
    print("相关系数:\n", r)
    return None


if __name__ == "__main__":
    variance_demo()

相关系数看前面那个
在这里插入图片描述

主成分分析

什么是主成分分析(PCA)

定义:高维数据转换为低维数据的过程,在此过程中可能会舍弃原有数据、创造新的变量
作用:是数据维数的压缩,尽可能降低原数据的维数(复杂度),损失少量信息
应用:回归分析或者聚类分析中

计算案例理解

二维降到一维

API

sklearn.decomposition.PCA(n_components=None)
  • 将数据分解为较低维度空间
  • n_components:
  • 小数:表示保留百分之多少的信息
  • 整数:减少到多少特征
    PCA.fit_transform(X),X:numpy array格式的数据[N_samples, n_features],返回值:转换后指定维度的array

数据计算

from sklearn.decomposition import PCA


def pca_demo():
    """
    PCA降维
    :return:
    """
    data = [[2,8,4,5], [6,3,0,8], [5,4,9,1]]
    # 1、实例化一个转换器类
    transform = PCA(n_components=2)  # 4个特征降到2个特征

    # 2、调用fit_transform
    data_new = transform.fit_transform(data)
    print("data_new\n", data_new)

    transform2 = PCA(n_components=0.95)  # 保留95%的信息
    data_new2 = transform2.fit_transform(data)
    print("data_new2\n", data_new2)

    return None

if __name__ == "__main__":
    pca_demo()

在这里插入图片描述

案例:探究用户对物品类别的喜好细分降维

数据:

  • order_prodects_prior.csv:订单与商品信息
    字段:order_id, product_id, add_to_cart_order, reordered
  • products.csv:商品信息
    字段:product_id, product_name, aisle_id, department_id
  • order.csv:用户的订单信息
    字段:order_id, user_id, eval_set, order_number, …
  • aisles.csv:商品所属具体物品类别
    字段:aisle_id, aisle

需求

  • 需要将user_id和aisle放在同一个表中—合并
  • 找到user_id和aisle----交叉表和透视表
# 1、获取数据
# 2、合并表
# 3、找到suer_id和aisle之间的关系
# 4、PAC降维
import pandas as pd
from sklearn.decomposition import PCA
# 1、获取数据
order_products = pd.read_csv('order_products__prior.csv') #32434489× 4
products = pd.read_csv('products.csv')  # (49688,4)
orders = pd.read_csv('orders.csv')     #3421083 rows × 7 columns
aisles = pd.read_csv('aisles.csv')  #(134,2)
# 2、合并表'
# 合并aisles和products
tab1 = pd.merge(aisles, products, on=["aisle_id", "aisle_id"]) #49688 × 5 c
tab2 = pd.merge(tab1, order_products, on=["product_id", "product_id"])#32434489 ,8
tab3 = pd.merge(tab2, orders, on=["order_id", "order_id"])#32434489 ,14
# tab3.head()
# 3、找到suer_id和aisle之间的关系
table = pd.crosstab(tab3["user_id"], tab3["aisle"]) #206209 rows × 134 columns
data = table[:10000] #10000 rows × 134 columns
print(data)
# 4、PAC降维
# 1)实例化一个转换器类
transfer = PCA(n_components=0.95)  # 保留95%的信息
# 2)调用fit_transform
data_new = transfer.fit_transform(data)  #(10000, 42),由134个特征降维到42个
print(data_new)
print(data_new.shape)

在这里插入图片描述

在这里插入图片描述

  • 22
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值