【python学习笔记】自动抓取雅虎新闻的内容

最新推荐文章于 2018-12-27 09:05:37 发布

Sidney_VonWunderland

最新推荐文章于 2018-12-27 09:05:37 发布

阅读量3.4k

点赞数

分类专栏：【学习笔记】Python系列文章标签： python 新闻抓取雅虎

本文链接：https://blog.csdn.net/Sindy_Jen/article/details/44221599

版权

在雅虎新闻（http://news.yahoo.com/）搜索，过滤掉来源自雅虎新闻的新闻，提取在html源代码中包含的新闻正文，采用计算文段密度并提取最长文段为正文。对文本进行清洗，去除html标记、无用字段等垃圾，存成txt。再去除无效、过短等不符合质量要求的新闻，

存在的问题是一旦有http报错，就会终止程序，极大影响效率。

#coding:utf-8
import re
import urllib2
import chardet
from BeautifulSoup import BeautifulSoup

#提取网页正文，放入txt中
def remove_js_css (content):
    """ remove the the javascript and the stylesheet and the comment content (<script>....</script> and <style>....</style> <!-- xxx -->) """
    r = re.compile(r'''<script.*?</script>''',re.I|re.M|re.S)
    s = r.sub ('',content)
    r = re.compile(r'''<style.*?</style>''',re.I|re.M|re.S)
    s = r.sub ('', s)
    r = re.compile(r'''<!--.*?-->''', re.I|re.M|re.S)
    s = r.sub('',s)
    r = re.compile(r'''<meta.*?>''', re.I|re.M|re.S)
    s = r.sub('',s)
    r = re.compile(r'''<ins.*?</ins>''', re.I|re.M|re.S)
    s = r.sub('',s)
    return s

def remove_empty_line (content):
    """remove multi space """
    r = re.compile(r'''^\s+$''', re.M|re.S)
    s = r.sub ('', content)
    r = re.compile(r'''\n+''',re.M|re.S)
    s = r.sub('\n',s)
    return s

def remove_any_tag (s):
    s = re.sub(r'''<[^>]+>'&

最低0.47元/天解锁文章

Sidney_VonWunderland

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
【python学习笔记】自动抓取雅虎新闻的内容

在雅虎新闻（http://news.yahoo.com/）搜索，过滤掉来源自雅虎新闻的新闻，提取在html源代码中包含的新闻正文，采用计算文段密度并提取最长文段为正文。对文本进行清洗，去除html标记、无用字段等垃圾，存成txt。再去除无效、过短等不符合质量要求的新闻，存在的问题是一旦有http报错，就会终止程序，极大影响效率。#coding:utf-8import reimport
复制链接

扫一扫