Python爬取百度新闻数据并将时间统计到EXCEL中

Python爬取百度新闻数据并按月季度统计

最新推荐文章于 2025-01-06 18:54:40 发布

原创

最新推荐文章于 2025-01-06 18:54:40 发布 · 4k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫

本文讲述了作者为朋友的毕业论文爬取百度新闻数据的任务，使用Python结合BeautifulSoup、urllib和requests库，实现了爬虫程序。程序包括主函数GCWspider_main.py，下载器html_downloader.py，解析器html_parser.py和输出器html_output.py，旨在统计基金经理每月和每季度的新闻数量。

缘起

我的好朋友的毕业论文需要爬取基金经理的新闻数量，并且统计新闻数量与基金的成交率的关系，我当然义不容辞啦。
任务描述：爬取三百位基金经理“百度新闻”中的搜索结果，并且将其分别按月和按季度统计新闻数量。

使用到的技术

beatifulsoup,urllib, request,python文件I/O

Talk is cheap，show the code

主函数：GCWspider_main.py


import url_manager,html_downloader,html_parser,html_output
import xlwt
import xlrd
import urllib


class SpiderMain(object):
    def __init__(self):
        self.urls=url_manager.UrlManager()
        self.downloader=html_downloader.HtmlDownoader()
        self.parser=html_parser.HtmlParser()
        self.output=html_output.HtmlOutputer()
    def craw(self,sheet1,sheet2,root_url,num,name):
        count=1
        listZeros=[0]
        resultlistM=listZeros*((2016-2000)*12)
        resultlistS = listZeros * ((2016 - 2000) * 4)


        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url=self.urls.get_new_url()
                print('crawling URL => %d ... : %s' % (count, new_url))
                html_cont=self.downloader.download(new_url)
                new_urls, resultlistM,resultlistS=self.parser.parse(new_url,html_cont,resultlistM,resultlistS)
                self.urls.add_new_urls(new_urls)


                #if count==100:
                #    break
                count=count+1

            except Exception as e:
                print(e)
                print('crawing failure')

        #self.output.output_html()
        self.output.collect_data(sheet1, resultlistM,name,num)
        self.output.collect_data(sheet2, resultlistS, name, num)




if __name__==

最低0.47元/天解锁文章