python3爬虫(1)--百度百科的页面爬取

最新推荐文章于 2019-11-03 15:12:00 发布

potato_big

最新推荐文章于 2019-11-03 15:12:00 发布

阅读量1.3k

点赞数

分类专栏： python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/Pual_wang/article/details/52746112

版权

python爬虫专栏收录该内容

8 篇文章 2 订阅

订阅专栏

新手起步，准备用写博客的形式记录下自己的学习路程，我用的是python3.5 ，大家互相学习，多多讨论

这个对于百度百科的爬取往往作为python爬虫的第一课，慕课上有这个教程的视频（请看这里），博主只是记录一下我的学历历程。对于一些基础不扎实的朋友呢，建议先看一下廖雪峰老师的python课程（传送门），其实博主的基本功也不扎实，在写这段代码的时候有时候也会去看两眼，多练习，多动手就好了

言归正传，爬取百度百科页面开始了

这个爬虫小程序的目的是爬取百度百科的有关python的随意几个页面，博主这里爬取的是100个页面。并将这100个页面的标题和简介的相关内容以HTML表格的形式展现出来。只是用了python的内置模块urllib和第三方BeautifulSoup

首先，分为了以下几个文件，spider_main(调度程序)、url_manager(url管理)、html_downloder(下载器)、html_parser(解析器)、html_outputer(结果输出管理)

调度程序的代码如下：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import html_parser,url_manager,html_downloder,html_outputer

class SpiderMain(object):
    def __init__(self):
        self.url_manager=url_manager.UrlManager()
        self.parser=html_parser.HtmlParser()
        self.outputer=html_outputer.HtmlOutputer()
        self.downloder=html_downloder.HtmlDownloder()

    def craw(self, root_url):
        summ=1
        #将root_url添加到url_manager中
        self.url_manager.add_new_url(root_url)
        #判断是否有新的url等待爬取
        while self.url_manager.has_new_url():
            try:
                new_url=self.url_manager.get_new_url()
                print('craw %d:%s'%(summ,new_url))
                html_cont=self.downloder.downlod(new_url)
                new_urls,new_data=self.parser.parser(new_url,html_cont)
                self.url_manager.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)

                if summ==100:
                    break
                summ=summ+1
            except:
                print('craw faild')
        self.outputer.output_html()


if __name__=='__main__':
    root_url='http://baike.baidu.com/view/21087.htm'
    #初始化一个SpiderMain实例
    obj_spider=SpiderMain()
    obj_spider.craw(root_url)

url管理器的代码如下：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
class UrlManager(object):

    def __init__(self):
    #通过一个set()实现对url队列的管理，set()不会出现重复的元素
        self.new_urls=set()
        self.old_urls=set()

    def add_new_url(self, url):
        if url is None:
            return None
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)
    def add_new_urls(self, urls):
        if urls is None or len(urls)==0:
            return None
        for url in urls:
            self.add_new_url(url)
    def has_new_url(self):
        return not len(self.new_urls) == 0

    def get_new_url(self):
        newurl=self.new_urls.pop()
        self.old_urls.add(newurl)
        return newurl

下载器如下：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import urllib.request

class HtmlDownloder(object):
    def downlod(self, new_url):
        if new_url is None:
            return None
        response = urllib.request.urlopen(new_url)
        if response.getcode()!=200:
            return None
        return response.read()

解析器的在这里：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
import urllib.parse

class HtmlParser(object):
    def parser(self, new_url, html_cont):
        if new_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')
        new_urls = self._get_new_urls(new_url, soup)
        new_datas=self._get_new_datas(new_url, soup)
        return new_urls, new_datas

  #得到新的待爬取url组
    def _get_new_urls(self, new_url, soup):
        new_urls=set()
        links=soup.find_all('a',href=re.compile(r'/view/\d+\.htm'))
        for link in links:
            a_new_url=link['href']
            new_full_url=urllib.parse.urljoin(new_url,a_new_url)
            new_urls.add(new_full_url)
        return new_urls

    #得到标题、简介等内容
    def _get_new_datas(self, new_url, soup):
        res_data={}

        #存放url
        res_data['url']=new_url
        #存放标题  <dd class="lemmaWgt-lemmaTitle-title"> <h1>Python</h1>
        title_node=soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
        res_data['title']=title_node.get_text()
        #存放简介<div class="lemma-summary" label-module="lemmaSummary">
        summary_node=soup.find('div',class_='lemma-summary')
        res_data['summary']=summary_node.get_text()

        return res_data

输出管理：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

#这里可能有人会遇到编码问题，博主会更新进行解释
class HtmlOutputer(object):
    def __init__(self):
        self.data=[]

    def output_html(self):
        ff=open('output.html','w', encoding="utf-8")


        ff.write('<html>')
        ff.write("<html><meta charset=\"utf-8\" />")
        ff.write('<body>')
        ff.write('<table border="1">')

        for data in self.data:
            ff.write('<tr>')
            ff.write('<td>%s</td>'% data['url'])
            ff.write('<td>%s</td>' % data['title'])
            ff.write('<td>%s</td>' % data['summary'])
            ff.write('</tr>')

        ff.write('</table>')
        ff.write('</body>')
        ff.write('</html>')


        ff.close()
    def collect_data(self, new_data):
        if new_data is None:
            return
        self.data.append(new_data)