python获取信息,Python信息获取实现

最新推荐文章于 2023-02-13 22:44:26 发布

weixin_39557797

最新推荐文章于 2023-02-13 22:44:26 发布

阅读量80

点赞数

文章标签： python获取信息

I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplished, using mutual_info_classif from sklearn. However, this method is really slow, so I was trying to implement information gain myself based on this post.

I came up with the following solution:

from scipy.stats import entropy

import numpy as np

def information_gain(X, y):

def _entropy(labels):

counts = np.bincount(labels)

return entropy(counts, base=None)

def _ig(x, y):

# indices where x is set/not set

x_set = np.nonzero(x)[1]

x_not_set = np.delete(np.arange(x.shape[1]), x_set)

h_x_set = _entropy(y[x_set])

h_x_not_set = _entropy(y[x_not_set])

return entropy_full - (((len(x_set) / f_size) * h_x_set)

+ ((len(x_not_set) / f_size) * h_x_not_set))

entropy_full = _entropy(y)

f_size = float(X.shape[0])

scores = np.array([_ig(x, y) for x in X.T])

return scores

Using a very small dataset, most scores from sklearn and my implementation are equal. However, sklearn seems to take frequencies into account, which my algorithm clearly doesn't. For example

categories = ['talk.religion.misc', 'comp.graphics', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train',

categories=categories)

X, y = newsgroups_train.data, newsgroups_train.target

cv = CountVectorizer(max_df=0.95, min_df=2,

max_features=100,

stop_words='english')

X_vec = cv.fit_transform(X)

t0 = time()

res_sk = mutual_info_classif(X_vec, y, discrete_features=True)

print("Time passed for sklearn method: %3f" % (time()-t0))

t0 = time()

res_ig = information_gain(X_vec, y)

print("Time passed for ig: %3f" % (time()-t0))

for name, res_mi, res_ig in zip(cv.get_feature_names(), res_sk, res_ig):

print("%s: mi=%f, ig=%f" % (name, res_mi, res_ig))

sample output:

center: mi=0.011824, ig=0.003548

christian: mi=0.128629, ig=0.127122

color: mi=0.028413, ig=0.026397

com: mi=0.041184, ig=0.030458

computer: mi=0.020590, ig=0.012327

cs: mi=0.007291, ig=0.001574

data: mi=0.020734, ig=0.008986

did: mi=0.035613, ig=0.024604

different: mi=0.011432, ig=0.005492

distribution: mi=0.007175, ig=0.004675

does: mi=0.019564, ig=0.006162

don: mi=0.024000, ig=0.017605

earth: mi=0.039409, ig=0.032981

edu: mi=0.023659, ig=0.008442

file: mi=0.048056, ig=0.045746

files: mi=0.041367, ig=0.037860

ftp: mi=0.031302, ig=0.026949

gif: mi=0.028128, ig=0.023744

god: mi=0.122525, ig=0.113637

good: mi=0.016181, ig=0.008511

gov: mi=0.053547, ig=0.048207

So I was wondering if my implementation is wrong, or it is correct, but a different variation of the mutual information algorithm scikit-learn uses.

解决方案

A little late with my answer but you should look at Orange's implementation. Within their app it is used as a behind-the-scenes processor to help inform the dynamic model parameter building process.

The implementation itself looks fairly straightforward and could most likely be ported out. The entropy calculation first

def _entropy(dist):

"""Entropy of class-distribution matrix"""

p = dist / np.sum(dist, axis=0)

pc = np.clip(p, 1e-15, 1)

return np.sum(np.sum(- p * np.log2(pc), axis=0) * np.sum(dist, axis=0) / np.sum(dist))

class GainRatio(ClassificationScorer):

"""

Information gain ratio is the ratio between information gain and

the entropy of the feature's

value distribution. The score was introduced in [Quinlan1986]_

to alleviate overestimation for multi-valued features. See `Wikipedia entry on gain ratio

`_.

.. [Quinlan1986] J R Quinlan: Induction of Decision Trees, Machine Learning, 1986.

"""

def from_contingency(self, cont, nan_adjustment):

h_class = _entropy(np.sum(cont, axis=1))

h_residual = _entropy(np.compress(np.sum(cont, axis=0), cont, axis=1))

h_attribute = _entropy(np.sum(cont, axis=0))

if h_attribute == 0:

h_attribute = 1

return nan_adjustment * (h_class - h_residual) / h_attribute

weixin_39557797

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python获取信息,Python信息获取实现

I am currently using scikit-learn for text classification on the 20ng dataset. I want to calculate the information gain for a vectorized dataset. It has been suggested to me that this can be accomplis...
复制链接

扫一扫