TF-IDF的公式:
逆向文件频率 (inverse document frequency, IDF) IDF的主要思想是:如果包含词条t的文档越少, IDF越大,则说明词条具有很好的类别区分能力。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。
某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。
公式很简单,假如没有python环境,spark环境,如何得到样本的TF-IDF呢?可以试着使用Hive SQL来实现:
例子:
目前,我们有一份用户的访问某种事件的次数的数据hive表为user_tags,这里次数可以相当于词频,事件可以表示是词条。数据表大致字段是这样的:month,userid,tags ,month为月份 格式入:yyyyMM ,userid 为字符串用户 ID,tags为事件次数,为map类型,例如:a1:12,a2:23 a1为事件,12为事件出现的次数。
select count(1) as D,month from user_tags group by month;
select month,tag,count(1) as DF from (select month,tag from user_tags lateral view explode(tags) tb_tags as tag,tag_num) a group by month,tag;
select i1.month,i1.tag,log10((i2.D)/(i1.DF+1)) as IDF from
(select month,tag,count(1) as DF from (select month,tag from user_tags lateral view explode(tags) tb_tags as tag,tag_num) a group by month,tag) i1
left join (select count(1) as D,month from user_tags group by month) i2 on i1.month = i2.month;
select month,userid,tag,cast(tag_num as double) as TF from user_tags lateral view explode(tags) tb_tags as tag,tag_num;
(5)将上面的步骤汇总计算TFIDF(t,d,D),使用TF表关联IDF表,进行计算
create table user_tags_tfidf as
select month,userid,concat_ws(',',collect_set(concat(tag,':',TFIDF))) as tfidfs from
(select a.month,a.userid,a.tag,round(a.TF*b.IDF,2) as TFIDF from
(select month,userid,tag,cast(tag_num as double) as TF from user_tags lateral view explode(tags) tb_tags as tag,tag_num) a
left join
(
select i1.month,i1.tag,log10((i2.D)/(i1.DF+1)) as IDF from
(select month,tag,count(1) as DF from (select month,tag from user_tags lateral view explode(tags) tb_tags as tag,tag_num) a group by month,tag) i1
left join
(select count(1) as D,month from user_tags group by month) i2 on i1.month = i2.month
) b on a.month = b.month and a.tag = b.tag) c group by month,userid;
结论,这么简单的公式,SQL完全可以实现。