encoders.kryo
In the past I’ve posted about the various categorical encoding methods one can use for machine learning tasks, like one-hot encoding, ordinal or binary. In my OSS package, category_encodings, I’ve added a single scikit-learn compatible encoder called BaseNEncoder, which allows the user to pick a base (2 for binary, N for ordinal, 1 for one-hot, or anywhere in between), and get consistently encoded categorical variables out. Note that base 1 and one-hot aren’t really the same thing, but in this case it’s convenient to consider them as such.
过去,我曾发布过有关可以用于机器学习任务的各种分类编码方法的信息 ,例如单热编码,有序或二进制编码。 在我的OSS包category_encodings中,我添加了一个名为scikit-learn的兼容编码器BaseNEncoder ,该编码器允许用户选择一个底数(2为二进制,N为序数,1为单火,或介于两者之间的任意值),并获得一致编码的分类变量。 请注意,以1为基数和以1为底的整数并不是真正相同的东西,但是在这种情况下,将它们考虑为方便。
Practically, this adds very little new functionality, rarely do people use base-3 or base-8 or any base other than ordinal or binary in real problems. Where it becomes useful, however, is when this encoder is coupled with a grid search.
实际上,这几乎没有增加新功能,人们很少在实际问题中使用base-3或base-8或序数或二进制以外的任何基数。 但是,在此编码器与网格搜索结合使用时才变得有用。
from __future__ import print_function
from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from category_encoders.basen import BaseNEncoder
from examples.source_data.loaders import get_mushroom_data
from sklearn.linear_model import LogisticRegression
# first we get data from the mushroom dataset
X, y, _ = get_mushroom_data()
X = X.values # use numpy array not dataframe here
n_samples = X.shape[0]
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
# create a pipeline
ppl = Pipeline([
('enc', BaseNEncoder(base=2, return_df=False, verbose=True)),
('clf', LogisticRegression())
])
# Set the parameters by cross-validation
tuned_parameters = {
'enc__base': [1, 2, 3, 4, 5, 6]
}
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %sn" % score)
clf = GridSearchCV(ppl, tuned_parameters, cv=5, scoring='%s_macro' % score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:n")
print(clf.best_params_)
print("nGrid scores on development set:n")
for mean, std, params in clf.grid_scores_:
print("%s (+/-%s) for %s" % (mean, std * 2, params))
print("nDetailed classification report:n")
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.n")
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
This code, from HERE, uses a normal scikit-learn grid search to find the optimal base for encoding categorical variables. The trade-off between between how well pairwise distances between categories and the final dataset dimensionality is no longer a difficult to tune parameter.
此代码来自HERE ,使用常规的scikit-learn网格搜索来找到用于对分类变量进行编码的最佳基础。 在类别之间的成对距离与最终数据集维度之间的权衡不再是难以调整的参数。
By running the above script we get:
通过运行上面的脚本,我们得到:
Best parameters set found on development set:
{'enc__base': 1}
Grid scores on development set:
{'enc__base': 1} (+/-1.99905151856) for [ 1. 1. 1. 1. 0.9976247]
{'enc__base': 2} (+/-1.98737951324) for [ 0.9805492 0.99763033 0.99621212 0.9964455 0.9976247 ]
{'enc__base': 3} (+/-1.95968049624) for [ 0.99411765 0.98387419 0.9651717 0.96970966 0.98633155]
{'enc__base': 4} (+/-1.96534331006) for [ 0.99500636 0.96541172 0.98387419 0.99013767 0.97892831]
{'enc__base': 5} (+/-1.96034803727) for [ 0.97773263 0.97556628 0.98636545 0.97058734 0.99063232]
{'enc__base': 6} (+/-1.93791104567) for [ 0.96788716 0.95480882 0.97648608 0.97769848 0.96790524]
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
precision recall f1-score support
0 1.00 1.00 1.00 2110
1 1.00 1.00 1.00 1952
avg / total 1.00 1.00 1.00 4062
# Tuning hyper-parameters for recall
Best parameters set found on development set:
{'enc__base': 1}
Grid scores on development set:
{'enc__base': 1} (+/-1.99802826596) for [ 0.99761905 1. 1. 1. 0.99744898]
{'enc__base': 2} (+/-1.98660963142) for [ 0.98904035 0.98854962 0.99745547 1. 0.99148239]
{'enc__base': 3} (+/-1.88434381179) for [ 0.95086332 0.8547619 0.94664667 0.98862857 0.97008487]
{'enc__base': 4} (+/-1.98025257596) for [ 0.99261178 0.98005271 0.98436023 0.99618321 0.99744898]
{'enc__base': 5} (+/-1.93166516505) for [ 0.98530534 0.98657761 0.89642857 0.9800385 0.98086735]
{'enc__base': 6} (+/-1.94647463413) for [ 0.96687568 0.97385496 0.99507452 0.95912053 0.97123861]
Detailed classification report:
The model is trained on the full development set.
The scores are computed on the full evaluation set.
precision recall f1-score support
0 1.00 1.00 1.00 2110
1 1.00 1.00 1.00 1952
avg / total 1.00 1.00 1.00 4062
Which shows us that for this relatively simple problem, with a small dataset, using the dimension-inefficient one-hot encoding (base=1) is the best option available.We’ve got a lot of cool projects in the pipeline in preparation for the 1.3.0 release, and the first release since being included in scikit-learn-contrib, so if you’re interested in this kind of work, head over to github or reach out here to get involved.
这表明,对于一个相对简单的问题,对于一个较小的数据集,使用尺寸无效的单热点编码(base = 1)是最佳的选择。我们有很多很酷的项目正在准备中1.3.0版本以及scikit-learn-contrib中包含的第一个版本,因此,如果您对这种工作感兴趣,请访问github或联系这里。
分享这个: (Share this:)
翻译自: https://www.pybloggers.com/2016/12/basen-encoding-and-grid-search-in-category_encoders/
encoders.kryo