Python开发简单爬虫课程源代码及解析

Anntonnia

已于 2024-03-04 16:26:11 修改

阅读量899

点赞数 1

分类专栏： Python 文章标签： python

于 2020-09-15 11:12:31 首次发布

本文链接：https://blog.csdn.net/Anntonnia/article/details/108595403

版权

本文提供经过更新的徐老师'Python开发简单爬虫'课程源代码，适配最新库和百度百科变化，包含spider_main等5个模块的详细注释，适合Python爬虫初学者。通过运行spider_main.py可执行爬虫。

摘要由CSDN通过智能技术生成

徐老师的课程“Python开发简单爬虫”（链接：Python开发简单爬虫_python爬虫入门教程_python爬虫视频教程-慕课网）思路清晰，步骤详细，讲解细致，是非常好的Python爬虫开发入门课程。美中不足的是随着时间推移，课程中的有些库更新了，百度百科词条的url格式也变了，如果完全按照课程的内容开发代码是不能正常工作的。

下面是经过修改后的代码，而且为便于初学者阅读，添加了详细注释。

共5个源代码文件：spider_main.py, url_manager.py, html_downloader.py, html_parser.py, html_outputer.py
其中，spider_main.py是主文件，调用了其他几个文件。

运行方式：在IDE（如Geany）中运行spider_main.py 或在命令行方式下键入：py spider_main.py

爬虫调度器模块spider_main.py的源代码如下：

# -*- coding: utf-8 -*-
# 爬虫之调度器模块
# from: www.imooc.com 课程----Python开发简单爬虫----百度百科Python词条页面相关1000条链接的定向爬虫
# author: Anna Yao
# date: 2020-09-14

import url_manager, html_downloader, html_parser, html_outputer

class SpiderMain(object):
	def __init__(self):
		self.urls = url_manager.UrlManager() # url管理器
		self.downloader = html_downloader.HtmlDownloader() # 页面下载器
		self.parser = html_parser.HtmlParser() # 页面解析器
		self.outputer = html_outputer.HtmlOutputer() # 输出解析后的结果
	
	def craw(self, root_url):
		count = 1 # 对爬取的页面计数
		self.urls.add_new_url(root_url) # 将根页面url"https://baike.baidu.com/item/Python/407313"添加至new_urls集合
		while self.urls.has_new_url():
			try:
				new_url = self.urls.get_new_url()
				print('craw %d : %s' %(count, new_url)) # 显示正在爬取的页面url
				html_cont = self.downloader.download(new_url) # 下载当前url网页内容
				new_urls, new_data = self.parser.parse(new_url, html_cont) # 解