机器学习 — 发现群组

最新推荐文章于 2022-11-24 14:59:45 发布

lackep

最新推荐文章于 2022-11-24 14:59:45 发布

阅读量214

点赞数

本文链接：https://blog.csdn.net/gesanghuakaisunshine/article/details/79395406

版权

聚类

属于无监督学习

目的：找到数据集中的不同群组

分级聚类

主要思想是：

在数据集中找出两个最相似的节点
根据这两个节点生成一个新的聚类节点，这个节点的数据为两个子节点的数据的平均值,
将两个子节点从数据集中去除，将新的聚类节点加入数据
回到1，直至数据集中只剩一个节点

K-means聚类

使用分级聚类的时候，因为得计算所有数据的两两之间的距离，形成新的聚类之后还得重新计算，所以在数据集较大的时候计算量会很大。
除了分级聚类之外还有一种K-均值聚类方法，主要思想为：

随机创建（给定）k个点作为中心点
遍历数据集中每个点，找到距离最近的中心点，将该点划分在该中心点下
遍历并划分完成后，将各个中心点移到自己组下所有点的中心位置
回到2，直到移动之后的结果（不变）和上次一样

结果展示：使用树状图来展现聚类之后的结果

import feedparser
import re

# test
error_list = []

# 返回一个RSS订阅源的标题和包含单词计数情况的字典
def get_word_counts(url):
    # 解析订阅源
    doc = feedparser.parse(url)
    
    # 单词计数
    wc = {}
    
    # 遍历所有文章条目，统计所有单词出现次数
    for entry in doc.entries:
        if 'summary' in entry:
            summary = entry.summary
        else:
            summary = entry.description
        
        # 提取出所有单词
        words = get_words(entry.title + ' ' + summary)
        # 统计所有单词出现的次数
        for word in words:
            wc.setdefault(word, 0)
            wc[word] += 1
    print url
    if hasattr(doc.feed, 'title'):
        return doc.feed.title, wc
    error_list.append(url)
    return '', wc

# 分割出html中的所有单词
def get_words(html):
    # 取出所有html标记
    txt = re.compile(r'<[^.]>').sub('', html)
    
    # 利用所有非字母字符拆分出单词
    words = re.compile(r'[^A-Z^a-z]').split(txt)
    # 转换为小写返回
    return [word.lower() for word in words]

apcount = {}
word_counts = {}
feed_list = [line for line in file('feedlist.txt')]
# 读取每一个url并统计单词在每篇博客中出现的次数
for feed_url in feed_list:
    title, wc = get_word_counts(feed_url)
    if title == '':
        continue
    if title in word_counts:
        title += '1'
    print title
    word_counts[title] = wc
    # 统计单词词频
    for word, count in wc.items():
        apcount.setdefault(word, 0)
        if count > 1:
            apcount[word] += 1

# 设定词频边界，去除常见无用词
word_list = []
for w, bc in apcount.items():
    frac = float(bc) / len(feed_list)
    if frac > 0.1 and frac < 0.5:
        word_list.append(w)
    
out = file('blogdata.txt', 'w')
# 输出表头
out.write('Blog')
for word in word_list:
    out.write('\t%s' % word)
out.write('\n')

# 输出表格内容
for blog, wc in word_counts.items():
    out.write(blog)

    for word in word_list:
        if word in wc:
            out.write('\t%d' % wc[word])
        else:
            out.write('\t0')
    out.write('\n')

print error_list

http://feeds.feedburner.com/37signals/beMH

http://feeds.feedburner.com/blogspot/bRuz

http

最低0.47元/天解锁文章

lackep

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫