数学之路(3)-机器学习(3)-机器学习算法-贝叶斯定理(2)

最新推荐文章于 2024-04-25 15:15:15 发布

麦好

最新推荐文章于 2024-04-25 15:15:15 发布

阅读量2.2k

点赞数 1

分类专栏： AI与机器学习机器学习实践指南文章标签：数学人工智能算法机器学习文本分类

本文链接：https://blog.csdn.net/myhaspl/article/details/11709021

版权

机器学习实践指南同时被 2 个专栏收录

217 篇文章 79 订阅

订阅专栏

AI与机器学习

106 篇文章 7 订阅

订阅专栏

我们运用朴素贝叶斯技术对文本完成分类，我们编写网络爬虫代码在相关新闻网中下面搜索几类新闻，提取词条，形成词条概率数据,新闻类别及爬虫所爬取链接如下：

汽车 http://finance.chinanews.com/auto/gd.shtml
财经 http://finance.chinanews.com/cj/gd.shtml
健康 http://www.chinanews.com/jiankang.shtml
教育 http://www.chinanews.com/jiaoyu.shtml
军事 http://www.chinanews.com/mil/news.shtml

我们下载网页解析库Beautiful Soup

Beautiful Soup的最新版本

可以在此获取

（http://www.crummy.com/software/BeautifulSoup/bs4/download/）

文档：

（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）

首先，我们生成一个目录文件，内容如下：

汽车 http://finance.chinanews.com/auto/gd.shtml
财经 http://finance.chinanews.com/cj/gd.shtml
IT http://finance.chinanews.com/it/gd.shtml
体育 http://www.chinanews.com/sports.shtml
军事 http://www.chinanews.com/mil/news.shtml

然后对目录文件读取生成样本类别数据

#!/usr/bin/env python
#-*- coding: utf-8 -*-
#code:myhaspl@qq.com
#http://blog.csdn.net/myhaspl
#bayes文本分类
#本程序仅做机器学习研究
#本程序对新闻爬取的工作原理与搜索引擎相同，通过分析链接
#直接搜索新闻,计算词条概率

import numpy as np
import jieba
import urllib2
from bs4 import BeautifulSoup
import re

#读取网上新闻搜索目录
txt_class=[]
myclassfl = open('ClassList.txt')
try:
myclass_str = myclassfl.read()
myclass_str=unicode(myclass_str,'utf-8')
myclass_text=myclass_str.split()
for ii in xrange(0,len(myclass_text),2):
print ".",
txt_class.append((myclass_text[ii],myclass_text[ii+1]))
finally:
myclassfl.close()

接着将网页链接进行提取，比如下面是对军事网页的搜索

#爬取军事网页
#提取链接
links=[]
pattern = re.compile(r'(.*?)/\d+\.shtml')
purl='http://www.chinanews.com/mil/news.shtml'
page=urllib2.urlopen(purl)
soup = BeautifulSoup(page,fromEncoding="gb18030")
for link in soup.find_all('a'): 
    mylink=link.get('href')
    match = pattern.match(mylink)
    if match:
        links.append(mylink)

接着提取新闻链接的正文内容

#提取正文内容
ybtxt=[]
print u"\n提取正文内容"
for mypage in links:
    my_page=urllib2.urlopen(mypage)
    my_soup = BeautifulSoup(my_page,fromEncoding="gbk")
    print ".",
 ...............
    zw_start=my_txt.find(my_fs)+8
    last_txt=my_txt[zw_start:len(my_txt)]
    zw_end=last_txt.find(my_fs)
    page_content=my_txt[zw_start:zw_start+zw_end]
.............................

本博客所有内容是原创，如果转载请注明来源

http://blog.csdn.net/myhaspl/

麦好

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数学之路(3)-机器学习(3)-机器学习算法-贝叶斯定理(2)

我们运用朴素贝叶斯技术对文本完成分类，我们可以编写网络爬虫代码或手工搜索在相关新闻网中下面几类新闻，并下载形成文本文档库，文档资料目录如下：将若干样本文档分为以下几类：C000001 汽车C000002 财经C000003 ITC000004 体育C000005 军事我们下载网页解析库Beautiful SoupBeaut
复制链接

扫一扫