阿里云天池机器学习训练营任务三

最新推荐文章于 2023-01-28 09:54:33 发布

VIP文章眰恦393

最新推荐文章于 2023-01-28 09:54:33 发布

阅读量204

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_56123508/article/details/123863368

版权

1.1 LightGBM的介绍

LightGBM是2017年由微软推出的可扩展机器学习系统，是微软旗下DMKT的一个开源项目，由2014年首届阿里巴巴大数据竞赛获胜者之一柯国霖老师带领开发。它是一款基于GBDT（梯度提升决策树）算法的分布式梯度提升框架，为了满足缩短模型计算时间的需求，LightGBM的设计思路主要集中在减小数据对内存与计算性能的使用，以及减少多机器并行计算时的通讯代价。

LightGBM可以看作是XGBoost的升级豪华版，在获得与XGBoost近似精度的同时，又提供了更快的训练速度与更少的内存消耗。正如其名字中的Light所蕴含的那样，LightGBM在大规模数据集上跑起来更加优雅轻盈，一经推出便成为各种数据竞赛中刷榜夺冠的神兵利器。

LightGBM的主要优点：

简单易用。提供了主流的Python\C++\R语言接口，用户可以轻松使用LightGBM建模并获得相当不错的效果。
高效可扩展。在处理大规模数据集时高效迅速、高准确度，对内存等硬件资源要求不高。
鲁棒性强。相较于深度学习模型不需要精细调参便能取得近似的效果。
LightGBM直接支持缺失值与类别特征，无需对数据额外进行特殊处理

LightGBM的主要缺点：

相对于深度学习模型无法对时空位置建模，不能很好地捕获图像、语音、文本等高维数据。
在拥有海量训练数据，并能找到合适的深度学习模型时，深度学习的精度可以遥遥领先LightGBM。

1.2 LightGBM的应用

LightGBM在机器学习与数据挖掘领域有着极为广泛的应用。据统计LightGBM模型自2016到2019年在Kaggle平台上累积获得数据竞赛前三名三十余次，其中包括CIKM2017 AnalytiCup、IEEE Fraud Detection等知名竞赛。这些竞赛来源于各行各业的真实业务，这些竞赛成绩表明LightGBM具有很好的可扩展性，在各类不同问题上都可以取得非常好的效果。

同时，LightGBM还被成功应用在工业界与学术界的各种问题中。例如金融风控、购买行为识别、交通流量预测、环境声音分类、基因分类、生物成分分析等诸多领域。虽然领域相关的数据分析和特性工程在这些解决方案中也发挥了重要作用，但学习者与实践者对LightGBM的一致选择表明了这一软件包的影响力与重要性。

2. 实验室手册

2.1 学习目标

了解 LightGBM 的参数与相关知识
掌握 LightGBM 的Python调用并将其运用到英雄联盟游戏胜负预测数据集上

2.2 代码流程

Part1 基于英雄联盟数据集的LightGBM分类实践

Step1: 库函数导入
Step2: 数据读取/载入
Step3: 数据信息简单查看
Step4: 可视化描述
Step5: 利用 LightGBM 进行训练与预测
Step6: 利用 LightGBM 进行特征选择
Step7: 通过调整参数获得更好的效果

2.3 算法实战

2.3.1 基于英雄联盟数据集的LightGBM分类实战

在实践的最开始，我们首先需要导入一些基础的函数库包括：numpy （Python进行科学计算的基础软件包），pandas（pandas是一种快速，强大，灵活且易于使用的开源数据分析和处理工具），matplotlib和seaborn绘图。

#下载需要用到的数据集

!wget https://tianchi-media.oss-cn-beijing.aliyuncs.com/DSW/8LightGBM/high_diamond_ranked_10min.csv

Step1：函数库导入

##  基础函数库

import numpy as np

import pandas as pd

## 绘图函数库

import matplotlib.pyplot as plt

import seaborn as sns

D:\Software\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

  import pandas.util.testing as tm

本次我们选择英雄联盟数据集进行LightGBM的场景体验。英雄联盟是2009年美国拳头游戏开发的MOBA竞技网游，在每局比赛中蓝队与红队在同一个地图进行作战，游戏的目标是破坏敌方队伍的防御塔，进而摧毁敌方的水晶枢纽，拿下比赛的胜利。

现在共有9881场英雄联盟韩服钻石段位以上的排位比赛数据，数据提供了在十分钟时的游戏状态，包括击杀数、死亡数、金币数量、经验值、等级……等信息。列blueWins是数据的标签，代表了本场比赛是否为蓝队获胜。

数据的各个特征描述如下：

| 特征名称 | 特征意义 | 取值范围 | | | |--------------------------|------------------|----------|---|---| | WardsPlaced | 插眼数量 | 整数 | | | | WardsDestroyed | 拆眼数量 | 整数 | | | | FirstBlood | 是否获得首次击杀 | 整数 | | | | Kills | 击杀英雄数量 | 整数 | | | | Deaths | 死亡数量 | 整数 | | | | Assists | 助攻数量 | 整数 | | | | EliteMonsters | 击杀大型野怪数量 | 整数 | | | | Dragons | 击杀史诗野怪数量 | 整数 | | | | Heralds | 击杀峡谷先锋数量 | 整数 | | | | TowersDestroyed | 推塔数量 | 整数 | | | | TotalGold | 总经济 | 整数 | | | | AvgLevel | 平均英雄等级 | 浮点数 | | | | TotalExperience | 英雄总经验 | 整数 | | | | TotalMinionsKilled | 英雄补兵数量 | 整数 | | | | TotalJungleMinionsKilled | 英雄击杀野怪数量 | 整数 | | | | GoldDiff | 经济差距 | 整数 | | | | ExperienceDiff | 经验差距 | 整数 | | | | CSPerMin | 分均补刀 | 浮点数 | | | | GoldPerMin | 分均经济 | 浮点数 | | |

Step2：数据读取/载入

## 我们利用Pandas自带的read_csv函数读取并转化为DataFrame格式

df = pd.read_csv('./high_diamond_ranked_10min.csv')

y = df.blueWins

Step3：数据信息简单查看

## 利用.info()查看数据的整体信息

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 9879 entries, 0 to 9878

Data columns (total 40 columns):

 #   Column                        Non-Null Count  Dtype  

---  ------                        --------------  -----  

 0   gameId                        9879 non-null   int64  

 1   blueWins                      9879 non-null   int64  

 2   blueWardsPlaced               9879 non-null   int64  

 3   blueWardsDestroyed            9879 non-null   int64  

 4   blueFirstBlood                9879 non-null   int64  

 5   blueKills                     9879 non-null   int64  

 6   blueDeaths                    9879 non-null   int64  

 7   blueAssists                   9879 non-null   int64  

 8   blueEliteMonsters             9879 non-null   int64  

 9   blueDragons                   9879 non-null   int64  

 10  blueHeralds                   9879 non-null   int64  

 11  blueTowersDestroyed           9879 non-null   int64  

 12  blueTotalGold                 9879 non-null   int64  

 13  blueAvgLevel                  9879 non-null   float64

 14  blueTotalExperience           9879 non-null   int64  

 15  blueTotalMinionsKilled        9879 non-null   int64  

 16  blueTotalJungleMinionsKilled  9879 non-null   int64  

 17  blueGoldDiff                  9879 non-null   int64  

 18  blueExperienceDiff            9879 non-null   int64  

 19  blueCSPerMin                  9879 non-null   float64

 20  blueGoldPerMin                9879 non-null   float64

 21  redWardsPlaced                9879 non-null   int64  

 22  redWardsDestroyed             9879 non-null   int64  

 23  redFirstBlood                 9879 non-null   int64  

 24  redKills                      9879 non-null   int64  

 25  redDeaths                     9879 non-null   int64  

 26  redAssists                    9879 non-null   int64  

 27  redEliteMonsters              9879 non-null   int64  

 28  redDragons                    9879 non-null   int64  

 29  redHeralds                    9879 non-null   int64  

 30  redTowersDestroyed            9879 non-null   int64  

 31  redTotalGold                  9879 non-null   int64  

 32  redAvgLevel                   9879 non-null   float64

 33  redTotalExperience            9879 non-null   int64  

 34  redTotalMinionsKilled         9879 non-null   int64  

 35  redTotalJungleMinionsKilled   9879 non-null   int64  

 36  redGoldDiff                   9879 non-null   int64  

 37  redExperienceDiff             9879 non-null   int64  

 38  redCSPerMin                   9879 non-null   float64

 39  redGoldPerMin                 9879 non-null   float64

dtypes: float64(6), int64(34)

memory usage: 3.0 MB

## 进行简单的数据查看，我们可以利用 .head() 头部.tail()尾部

df.head()

[4]:

, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

	gameId	blueWins	blueWardsPlaced	blueWardsDestroyed	blueFirstBlood	blueKills	blueDeaths	blueAssists	blueEliteMonsters	blueDragons	.

最低0.47元/天解锁文章

眰恦393

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
阿里云天池机器学习训练营任务三

1.1 LightGBM的介绍LightGBM是2017年由微软推出的可扩展机器学习系统，是微软旗下DMKT的一个开源项目，由2014年首届阿里巴巴大数据竞赛获胜者之一柯国霖老师带领开发。它是一款基于GBDT（梯度提升决策树）算法的分布式梯度提升框架，为了满足缩短模型计算时间的需求，LightGBM的设计思路主要集中在减小数据对内存与计算性能的使用，以及减少多机器并行计算时的通讯代价。LightGBM可以看作是XGBoost的升级豪华版，在获得与XGBoost近似精度的同时，又提供了更快的训练速度与
复制链接

扫一扫