数据挖掘中的大数据量分批增量训练

1、lightgbm

提升机器算法LightGBM(图解+理论+增量训练python代码+lightGBM调参方法)

增量训练python代码

这个我好像代码在另一个电脑上,待更吧。。。星期一把代码完善一下。。。先简单介绍一下什么叫增量训练,就是他一下子吃不了那么多数据,内存会爆掉,但是需要读怎么办,就有一个流式读取的方法,本质上是个迭代器。。。

每次读取文件的一部分,用于训练模型,并保存模型的训练结果;然后读取文件的另一部分,再对模型进行更新训练;迭代读取全部数据完毕,最终完成整个文件数据的训练过程。

1. 文件的流式读取

 
  1. def iter_minibatches(minibatch_size=1000):

  2. '''

  3. 迭代器

  4. 给定文件流(比如一个大文件),每次输出minibatch_size行,默认选择1k行

  5. 将输出转化成numpy输出,返回X, y

  6. '''

  7. X = []

  8. y = []

  9. cur_line_num = 0

  10. train_data, train_label, train_weight, test_data, test_label, test_file = load_data()

  11. train_data, train_label = shuffle(train_data, train_label, random_state=0) # random_state=0用于记录打乱位置 保证每次打乱位置不变

  12. print(type(train_label), train_label)

  13. for data_x, label_y in zip(train_data, train_label):

  14. X.append(data_x)

  15. y.append(label_y)

  16. cur_line_num += 1

  17. if cur_line_num >= minibatch_size:

  18. X, y = np.array(X), np.array(y) # 将数据转成numpy的array类型并返回

  19. yield X, y

  20. X, y = [], []

  21. cur_line_num = 0

2. lightgbm(LGB)增量训练过程

 
  1. def lightgbmTest():

  2. import lightgbm as lgb

  3. # 第一步,初始化模型为None,设置模型参数

  4. gbm = None

  5. params = {

  6. 'task': 'train',

  7. 'application': 'regression', # 目标函数

  8. 'boosting_type': 'gbdt', # 设置提升类型

  9. 'learning_rate': 0.01, # 学习速率

  10. 'num_leaves': 50, # 叶子节点数

  11. 'tree_learner': 'serial',

  12. 'min_data_in_leaf': 100,

  13. 'metric': ['l1', 'l2', 'rmse'], # l1:mae, l2:mse # 评估函数

  14. 'max_bin': 255,

  15. 'num_trees': 300

  16. }

  17. # 第二步,流式读取数据(每次10万)

  18. minibatch_train_iterators = iter_minibatches(minibatch_size=10000)

  19. for i, (X_, y_) in enumerate(minibatch_train_iterators):

  20. # 创建lgb的数据集

  21. # y_ = list(map(float, y_)) # 将numpy.ndarray转变为list

  22. X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.1, random_state=0)

  23. y_train = y_train.ravel()

  24. lgb_train = lgb.Dataset(X_train, y_train)

  25. lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

  26. # 第三步:增量训练模型

  27. # 重点来了,通过 init_model 和 keep_training_booster 两个参数实现增量训练

  28. gbm = lgb.train(params,

  29. lgb_train,

  30. num_boost_round=1000,

  31. valid_sets=lgb_eval,

  32. init_model=gbm, # 如果gbm不为None,那么就是在上次的基础上接着训练

  33. # feature_name=x_cols,

  34. early_stopping_rounds=10,

  35. verbose_eval=False,

  36. keep_training_booster=True) # 增量训练

  37. print("{} time".format(i)) # 当前次数

  38. # 输出模型评估分数

  39. score_train = dict([(s[1], s[2]) for s in gbm.eval_train()])

  40. print('当前模型在训练集的得分是:mae=%.4f, mse=%.4f, rmse=%.4f'

  41. % (score_train['l1'], score_train['l2'], score_train['rmse']))

  42. return gbm

3. lightgbm(LGB)调用过程以及保存训练结果模型

 
  1. '''lightgbm增量训练'''

  2. print('lightgbm增量训练')

  3. train_data, train_label, train_weight, test_data, test_label, test_file = load_data()

  4. print(train_label.shape,train_data.shape)

  5. train_X, test_X, train_Y, test_Y = train_test_split(train_data, train_label, test_size=0.1, random_state=0)

  6. # train_X, train_Y = shuffle(train_data, train_label, random_state=0) # random_state=0用于记录打乱位置 保证每次打乱位置不变

  7. gbm = lightgbmTest()

  8. pred_Y = gbm.predict(test_X)

  9. print('compute_loss:{}'.format(compute_loss(test_Y, pred_Y)))

  10. # gbm.save_model('lightgbmtest.model')

  11. # 模型存储

  12. joblib.dump(gbm, 'loan_model.pkl')

  13. # 模型加载

  14. gbm = joblib.load('loan_model.pkl')

2、catboost

fit - CatBoostClassifier | CatBoost 

 fit(X, y=None, cat_features=None, text_features=None, embedding_features=None, sample_weight=None, baseline=None, use_best_model=None, eval_set=None, verbose=None, logging_level=None plot=False, column_description=None, verbose_eval=None, metric_period=None, silent=None, early_stopping_rounds=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, init_model=None, log_cout=sys.stdout, log_cerr=sys.stderr)

init_model

Description

The description is different for each group of possible types.

Possible types

The model to continue learning from.

Note

The initial model must have the same problem type as the one being solved in the current training (binary classification, multiclassification or regression/ranking).

None (incremental learning is not used)CPU

{{ [catboost.CatBoost](../concepts/python-reference_catboost.md), catboost.CatBoostClassifier](../concepts/python-reference_catboostclassifier.md) }}

The initial model object.

string

The path to the input file that contains the initial model.

Default value

None (incremental learning is not used)

Supported processing units

CPU

3、XGBoost

 

  • 3
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值