python scikit learn 封装_Python fastText的scikit-learn封装

skift

scikit-learn wrappers for Python fastText.

>>> from skift import FirstColFtClassifier

>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])

>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)

>>> sk_clf.fit(df[['txt']], df['lbl'])

>>> sk_clf.predict([['woof']])

[0]

Dependencies:

numpy

scipy

scikit-learn

The fasttext Python package

pip install skift

Because fasttext reads input data from files, skift has to dump the input data into temporary files for fasttext to use. A dedicated folder is created for those files on the filesystem. By default, this storage is allocated in the system temporary storage location (i.e. /tmp on *nix systems). To override this default location, use the SKIFT_TEMP_DIR environment variable:

export SKIFT_TEMP_DIR=/path/to/desired/temp/folder

NOTE: The directory will be created if it does not already exist.

Adheres to the scikit-learn classifier API, including predict_proba.

Also caters to the common use case of pandas.DataFrame inputs.

Enables easy stacking of fastText with other types of scikit-learn-compliant classifiers.

Pickle-able classifier objects.

Pure python.

Supports Python 3.5+.

fastText works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the fastText classifier use a single column as input, ignoring other columns. This is especially true when fastText is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

skift includes several scikit-learn-compatible wrappers (for the official fastText Python package) which cater to these use cases.

NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the fastText.train_supervised method on every call to fit.

These wrappers do not make additional assumptions on input besides those commonly made by scikit-learn classifies; i.e. that input is a 2d ndarray object and such.

FirstColFtClassifier - An sklearn classifier adapter for fasttext that takes the first column of input ndarray objects as input.

>>> from skift import FirstColFtClassifier

>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])

>>> sk_clf = FirstColFtClassifier(lr=0.3, epoch=10)

>>> sk_clf.fit(df[['txt']], df['lbl'])

>>> sk_clf.predict([['woof']])

[0]

IdxBasedFtClassifier - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the input_ix parameter to the constructor.

>>> from skift import IdxBasedFtClassifier

>>> df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl'])

>>> sk_clf = IdxBasedFtClassifier(input_ix=1, lr=0.4, epoch=6)

>>> sk_clf.fit(df[['count', 'txt']], df['lbl'])

>>> sk_clf.predict([['woof']])

[0]

These wrappers assume the X parameter given to fit, predict, and predict_proba methods is a pandas.DataFrame object:

FirstObjFtClassifier - An sklearn adapter for fasttext using the first column of dtype == object as input.

>>> from skift import FirstObjFtClassifier

>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])

>>> sk_clf = FirstObjFtClassifier(lr=0.2)

>>> sk_clf.fit(df[['txt']], df['lbl'])

>>> sk_clf.predict([['woof']])

[0]

ColLblBasedFtClassifier - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the input_col_lbl parameter to the constructor.

>>> from skift import ColLblBasedFtClassifier

>>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl'])

>>> sk_clf = ColLblBasedFtClassifier(input_col_lbl='txt', epoch=8)

>>> sk_clf.fit(df[['txt']], df['lbl'])

>>> sk_clf.predict([['woof']])

[0]

Package author and current maintainer is Shay Palachy (shay.palachy@gmail.com); You are more than welcome to approach him for help. Contributions are very welcomed.

Clone:

git clone git@github.com:shaypal5/skift.git

Install in development mode, including test dependencies:

cd skift

pip install -e '.[test]'

To also install fasttext, see instructions in the Installation section.

To run the tests use:

cd skift

pytest

The project is documented using the numpy docstring conventions, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow these conventions.

Additionally, if you update this README.rst file, use python setup.py checkdocs to validate it compiles.

Created by Shay Palachy (shay.palachy@gmail.com).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值