ligthgbm分类与回归实例展示

本文仅做摘抄记录,展示一些lgbm用作分类与回归的代码,以供学习记忆与备用。
lgbm的github:
https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
参数解释:
https://blog.csdn.net/ssswill/article/details/85235074
示例代码:https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide
官方文档:
https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

1.回归

1.0回归0

代码:https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py

# coding: utf-8
# pylint: disable = invalid-name, C0111
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error

print('Loading data...')
# load or create your dataset
df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

print('Starting training...')
# train
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                early_stopping_rounds=5)

print('Saving model...')
# save model to file
gbm.save_model('model.txt')

print('Starting predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)

1.1回归1

代码来源:https://www.kaggle.com/chauhuynh/my-first-kernel-3-699

df_train_columns = [c for c in df_train.columns if c not in ['card_id', 'first_active_month','target','outliers']]
target = df_train['target']
del df_train['target']
param = {'num_leaves': 31,
         'min_data_in_leaf': 30, 
         'objective':'regression',
         'max_depth': -1,
         'learning_rate': 0.01,
         "min_child_samples": 20,
         "boosting": "gbdt",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'rmse',
         "lambda_l1": 0.1,
         "verbosity": -1,
         "nthread": 4,
         "random_state": 4590}
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=4590)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train,df_train['outliers'].values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][df_train_columns], label=target.iloc[trn_idx])#, categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][df_train_columns], label=target.iloc[val_idx])#, categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 100)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = df_train_columns
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[df_train_columns], num_iteration=clf.best_iteration) / folds.n_splits

np.sqrt(mean_squared_error(oof, target))

在这里插入图片描述
在这里插入图片描述

cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:1000].index)

best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure(figsize=(14,25))
sns.barplot(x="importance",
            y="Feature",
            data=best_features.sort_values(by="importance",
                                           ascending=False))
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances.png')

在这里插入图片描述
分析:
1.上面代码的超参数是提前选好了的,所以如果你也选好了,那么也可以按上面那样来。
2.上面代码很明显是5折交叉验证时,5折就意味着训练出来了5个模型,同时每个模型都会对test做预测。同时每个预测值都除以5,再把5个模型的预测结果除5之后的相加。也就是做了一个平均,增加泛化能力。
最终结果:
y_pred = (y_1+y_2+y_3+y_4+y_5)/5

1.2回归2

与回归1高度相似。也可不用看。

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgb


lgb_params = {"objective" : "regression", "metric" : "rmse", 
               "max_depth": 7, "min_child_samples": 20, 
               "reg_alpha": 1, "reg_lambda": 1,
               "num_leaves" : 64, "learning_rate" : 0.01, 
               "subsample" : 0.8, "colsample_bytree" : 0.8, 
               "verbosity": -1}

FOLDs = KFold(n_splits=5, shuffle=True, random_state=42)

oof_lgb = np.zeros(len(train_X))
predictions_lgb = np.zeros(len(test_X))

features_lgb = list(train_X.columns)
feature_importance_df_lgb = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(FOLDs.split(train_X)):
    trn_data = lgb.Dataset(train_X.iloc[trn_idx], label=train_y.iloc[trn_idx])
    val_data = lgb.Dataset(train_X.iloc[val_idx], label=train_y.iloc[val_idx])

    print("-" * 20 +"LGB Fold:"+str(fold_)+ "-" * 20)
    num_round = 10000
    clf = lgb.train(lgb_params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 50)
    oof_lgb[val_idx] = clf.predict(train_X.iloc[val_idx], num_iteration=clf.best_iteration)

    fold_importance_df_lgb = pd.DataFrame()
    fold_importance_df_lgb["feature"] = features_lgb
    fold_importance_df_lgb["importance"] = clf.feature_importance()
    fold_importance_df_lgb["fold"] = fold_ + 1
    feature_importance_df_lgb = pd.concat([feature_importance_df_lgb, fold_importance_df_lgb], axis=0)
    predictions_lgb += clf.predict(test_X, num_iteration=clf.best_iteration) / FOLDs.n_splits
    

print("Best RMSE: ",np.sqrt(mean_squared_error(oof_lgb, train_y)))

其实。lgbm关键语句就是:

    clf = lgb.train(param, trn_data, num_round, 
    valid_sets = [trn_data, val_data], verbose_eval=100, 
    early_stopping_rounds = 100)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][df_train_columns], num_iteration=clf.best_iteration)

而我们常见的sklearn系列是fit,predict。

2.分类

2.1二分类

code from:https://www.kaggle.com/waitingli/combining-your-model-with-a-model-without-outlier

param = {'num_leaves': 31,
         'min_data_in_leaf': 30, 
         'objective':'binary',
         'max_depth': 6,
         'learning_rate': 0.01,
         "boosting": "rf",
         "feature_fraction": 0.9,
         "bagging_freq": 1,
         "bagging_fraction": 0.9 ,
         "bagging_seed": 11,
         "metric": 'binary_logloss',
         "lambda_l1": 0.1,
         "verbosity": -1,
         "random_state": 2333}
%%time
folds = KFold(n_splits=5, shuffle=True, random_state=15)
oof = np.zeros(len(df_train))
predictions = np.zeros(len(df_test))
feature_importance_df = pd.DataFrame()

start = time.time()


for fold_, (trn_idx, val_idx) in enumerate(folds.split(df_train.values, target.values)):
    print("fold n°{}".format(fold_))
    trn_data = lgb.Dataset(df_train.iloc[trn_idx][features], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(df_train.iloc[val_idx][features], label=target.iloc[val_idx], categorical_feature=categorical_feats)

    num_round = 10000
    clf = lgb.train(param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=100, early_stopping_rounds = 200)
    oof[val_idx] = clf.predict(df_train.iloc[val_idx][features], num_iteration=clf.best_iteration)
    
    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = features
    fold_importance_df["importance"] = clf.feature_importance()
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    
    predictions += clf.predict(df_test[features], num_iteration=clf.best_iteration) / folds.n_splits

print("CV score: {:<8.5f}".format(log_loss(target, oof)))

在这里插入图片描述

### 'target' is the probability of whether an observation is an outlier
df_outlier_prob = pd.DataFrame({"card_id":df_test["card_id"].values})
df_outlier_prob["target"] = predictions
df_outlier_prob.head()

在这里插入图片描述

  • 10
    点赞
  • 107
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
### 回答1: 对于二分类logistic回归实例的结果分析,需要考虑以下几个方面: 1. 模型的拟合程度:可以通过查看模型的拟合优度指标(如AIC、BIC、对数似然等)来评估模型的拟合程度。如果这些指标越小,说明模型的拟合程度越好。 2. 系数的显著性:可以通过查看系数的标准误、z值和p值来评估系数的显著性。如果系数的p值小于.05,则说明该系数是显著的,可以认为该变量对目标变量的影响是有意义的。 3. 变量的重要性:可以通过查看变量的系数大小和方向来评估变量的重要性。如果系数的绝对值越大,说明该变量对目标变量的影响越大;如果系数的符号为正,则说明该变量与目标变量正相关,反之则为负相关。 4. 模型的预测能力:可以通过查看模型的ROC曲线和AUC值来评估模型的预测能力。如果ROC曲线下面积越大,说明模型的预测能力越好;如果AUC值越接近1,则说明模型的预测能力越强。 需要注意的是,以上评估指标都是相对的,需要根据具体情况进行综合考虑。同时,还需要注意模型的稳定性和可解释性,以便更好地应用于实际问题中。 ### 回答2: 二分类logistic回归是一种常见的机器学习算法,它可以用于解决二分类问题,例如预测某个人是否会购买某个产品、是否会违约等。在R语言中进行二分类logistic回归可以使用glm函数。 下面我们来看一个实例,假设我们有一份数据集,其中包含了1000个人的年龄、婚姻状况、职业等信息,以及他们是否购买了一款新产品,数据集如下: ``` age marital_status occupation purchase 35 married engineer yes 27 single student no 52 married doctor yes ... ``` 我们可以利用二分类logistic回归来预测某个人是否会购买该产品。首先我们需要将数据集划分为训练集和测试集,一般将80%的数据用于训练,20%的数据用于测试。 ``` #加载数据集 library(readxl) data <- read_excel("dataset.xlsx") #将数据集划分为训练集和测试集 library(caTools) set.seed(123) split <- sample.split(data$purchase, SplitRatio = 0.8) train_data <- subset(data, split == TRUE) test_data <- subset(data, split == FALSE) ``` 接下来,我们利用训练集来训练模型。 ``` #利用训练集训练模型 model <- glm(purchase ~ age + marital_status + occupation, family=binomial(link='logit'), data = train_data) ``` 其中,`family=binomial(link='logit')`指定了二分类logistic回归模型,在模型中我们选取了年龄、婚姻状况和职业作为自变量。 训练完模型后,我们可以利用测试集来验证模型的准确率。 ``` #利用测试集验证模型的准确率 predicted <- predict(model, newdata = test_data, type = 'response') result <- ifelse(predicted > 0.5, "yes", "no") table(result, test_data$purchase) ``` 本例中,我们将预测结果大于0.5的判定为“是”,小于等于0.5的判定为“否”,然后利用`table()`函数来生成混淆矩阵,评估模型的准确率。 最终的结果是: ``` no yes no 148 17 yes 23 112 ``` 其中,`no`代表未购买该产品,`yes`代表购买该产品。在本例中,我们的模型预测准确率为88.7%。 综上所述,二分类logistic回归是一种常用的机器学习算法,可以用于解决二分类问题。在R语言中,我们可以使用glm函数来进行二分类logistic回归,并利用训练集和测试集来验证模型的准确率。 ### 回答3: 在R语言中,二分类logistic回归是一个常见的机器学习算法,主要用于预测二元分类问题。下面给出一个实际的二分类logistic回归实例,以及对其结果的分析。 实例:假设我们想要通过一个人的年龄、性别、收入和婚姻状况等因素来预测是否会购买一款高档化妆品。 首先,我们需要从已有的数据样本中提取这些特征并对这些特征进行数据清洗。为了方便建模,我们可以将分类变量转化为虚拟变量,并对数值变量进行归一化。接下来,我们使用R语言里的glm函数,将这些特征作为参数,进行二分类logistic回归的模型训练。示例如下: ``` # 读取数据 df <- read.csv('beauty.csv') # 数据处理 df$Married <- ifelse(df$Married == "Yes", "1", "0") df$Gender_Female <- ifelse(df$Gender == "Female", "1", "0") df$Age <- scale(df$Age) df$Income <- scale(df$Income) # 构建模型 logistic_model <- glm(Purchase ~ Age + Gender_Female + Income + Married, family = binomial(link = 'logit'), data = df) ``` 接下来,我们需要评估模型的性能。我们可以使用AUC(曲线下面积)或ROC(受试者工作特征图)作为度量标准,以及混淆矩阵来分析模型的准确性。混淆矩阵通常包括真阳性、真阴性、假阳性和假阴性四个参数,可以用以下代码来计算: ``` # 混淆矩阵 library(pROC) roc_df <- roc(df$Purchase, predict(logistic_model, type = 'response')) confusion_matrix <- table(round(predict(logistic_model, type = 'response')), df$Purchase) ``` 最后,我们可以可视化分类器的性能,来更好地理解模型的准确性和误差率。以下代码将直接绘制ROC曲线图: ``` # ROC曲线图 plot(roc_df, main = "ROC Curve of Logistic Regression Model") ``` 经过以上步骤,我们就可以对这个二分类logistic回归实例进行结果分析。根据模型的ROC曲线图和混淆矩阵,我们可以发现: 1. AUC值达到了0.72,这说明模型的预测准确性不算很高,但仍然具有一定的预测价值; 2. 模型的真阳性率和假阳性率都较高,说明模型的分类质量比较一般,存在一定的误差。 综上所述,二分类logistic回归是一种常见的机器学习算法,可用于二元分类问题的预测。通过以上实例分析,我们可以更好地理解logistic回归的操作过程和结果分析,为实际问题的解决提供帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值