集体智慧学习编程—— 学习笔记一

最新推荐文章于 2022-03-20 15:57:50 发布

VIP文章老头打哈哈

最新推荐文章于 2022-03-20 15:57:50 发布

阅读量3.1k

点赞数 2

文章标签： python 集体智慧编程分级聚类

本文链接：https://blog.csdn.net/zhangjm_yb/article/details/50590593

版权

学习目标：

1、利用博客资源自我创建数据集；

2、利用皮尔逊相关度描述单个数据之间的紧密度；

3、对从新浪博客爬取的博客进行分类；

4、绘制树状图。

一、利用博客资源创建数据集：

我这里选取的是新浪博客，例如http://roll.finance.sina.com.cn/blog/blogarticle/cj-bkks/inde_1.shtml，其中url中数字1是页码。基于这个规律，可以抓取很多很多博客来充实数据集。

为了实现博客的抓取，这里我写了一个小爬虫，因为这不是学习的重点，这里就直接上代码了，我都写了注释的：

注意，我的运行环境是python2.7.

# -*- coding: utf-8 -*-
import urllib
from bs4 import BeautifulSoup
import codecs
import jieba
from collections import Counter

#获取当页所有博客的url，以list的方式返回
def get_all_urls(url):
    content = urllib.urlopen(url).read()
    soup = BeautifulSoup(content, 'lxml')   #利用beautifsoup进行html的解析
    url_list = list()
    for item in soup.find_all('ul', class_ = 'list_009'):
        for i in item.find_all('li'): url_list.append(i.a['href'])
    return url_list

#输入博客的url，返回博客内容
def get_content(url):
    text = urllib.urlopen(url).read()
    soup = BeautifulSoup(text, 'lxml')
    content = soup.find('div', class_ = 'articalContent').get_text()
    return content
    

words_list = list()   #标记词列表
dd = dict()
for i in range(1, 20):
    page = 'http://roll.finance.sina.com.cn/blog/blogarticle/cj-bkks/inde_' + str(i) + '.shtml'
    url_list = get_all_urls(page)
    for i in range(len(url_list)):
        url = url_list[i]
        content = get_content(url).strip()
        filename = str(i) + '.txt'
        file = c

最低0.47元/天解锁文章

老头打哈哈

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
集体智慧学习编程—— 学习笔记一

学习目标：1、利用博客资源自我创建数据集；2、利用皮尔逊相关度描述单个数据之间的紧密度；3、对从新浪博客爬取的博客进行分类；4、绘制树状图。一、利用博客资源创建数据集：我这里选取的是新浪博客，例如http://roll.finance.sina.com.cn/blog/blogarticle/cj-bkks/inde_1.shtml，其中url中数字1是页码。基于
复制链接

扫一扫