[MATLAB] Simple TF-IDF implementation

Term-Frequency word weighting scheme is one of most used in normalization of document-term matrices in text mining and information retrieval.

See wikipedia for details.

 

tfidf

 

function Y = tfidf( X )
% FUNCTION computes TF-IDF weighted word histograms.
%
%   Y = tfidf( X );
%
% INPUT :
%   X        - document-term matrix (documents in columns)
%
% OUTPUT :
%   Y        - TF-IDF weighted document-term matrix
%
 
% get term frequencies
X = tf(X);
 
% get inverse document frequencies
I = idf(X);
 
% apply weights for each document
for j=1:size(X, 2)
    X(:, j) = X(:, j)*I(j);
end
 
Y = X;
 
 
function X = tf(X)
% SUBFUNCTION computes word frequencies
 
% for every word
for i=1:size(X, 1)
    
    % get word i counts for all documents
    x = X(i, :);
    
    % sum all word i occurences in the whole collection
    sumX = sum( x );
    
    % compute frequency of the word i in the whole collection
    if sumX ~= 0
        X(i, :) = x / sum(x);
    else
        % avoiding NaNs : set zero to never appearing words
        X(i, :) = 0;
    end
    
end
 
 
function I = idf(X)
% SUBFUNCTION computes inverse document frequencies
 
% m - number of terms or words
% n - number of documents
[m, n]=size(X);
 
% allocate space for document idf's
I = zeros(n, 1);
 
% for every document
for j=1:n
    
    % count non-zero frequency words
    nz = nnz( X(:, j) );
    
    % if not zero, assign a weight:
    if nz
        I(j) = log( m / nz );
    end
    
end
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值