慕课 Python开发简单爬虫之抓取百度百科1000个词条页面数据

最新推荐文章于 2021-11-22 00:47:36 发布

chenhui229

最新推荐文章于 2021-11-22 00:47:36 发布

阅读量393

点赞数

分类专栏： Python爬虫文章标签： Python 爬虫

本文链接：https://blog.csdn.net/chenhui229/article/details/81119883

版权

Python爬虫专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近学习了慕课上关于用Python开发简单爬虫的课程，并根据课程在本地实现了抓取百度百科1000个词条页面数据。课程地址为：https://www.imooc.com/learn/563

目标：以Python百度百科网页为入口，爬取1000个相关的百度百科页面的title和简介，并以html格式output出来

实现方法：

模块	功能
spider_main.py	为总调度程序。首先将root_url即python百度百科的地址添加至new_urls集合中，作为第一个网址。从new_urls集合中读取一个new_url，用html_downloader.py将new_url的内容下载下来，用html_parser.py从下载下来的网页数据中获取相关百度百科网址和数据字典，再将本次获取的数据字典添加至datas的列表中。上述过程循环1000次，最后将datas列表output至html中。
urlManager.py	url管理器，主要实现四个功能：1）添加一个url到new_urls集合；2）添加多个urls到new_urls集合；3）判断new_urls里是否还有新的url供读取；4）从new_urls里拿到一个新的url，并将它从new_urls里面移除，添加至old_urls集合里面
html_downloader.py	用来下载网页数据，输入为一个地址，输出为download下来的网页数据
html_parser.py	用来解析网页数据，从中获取网页的title，简介，以及网页中的百度百科网址。输入为download下来的网页数据，输出为解析出来的网址集合以及数据字典（包括解析网页的地址，title和简介）。
html_outputer.py	有两个功能：1）将html_parser.py中输出的数据字典集合在datas的列表中；2）将datas用html格式output出来。

spider_main.py

#coding=utf-8
import urlManager, html_downloader, html_parser, html_outputer

class spider_main(object):
	
	def __init__(self):
		self.urls = urlManager.urlManager()
		self.downloader = html_downloader.htmlDownloader()
		self.parser = html_parser.htmlParser()
		self.outputer = html_outputer.htmlOutputer()

	def craw(self,root_url):
		
		self.urls.add_new_url(root_url)
		count = 1
		while self.urls.has_new_url():
			try:
				url = self.urls.get_new_url()
				html_cont = self.downloader.download(url)
				print ("craw %d: %s" %(count, url))
				#page_url = "https://baike.baidu.com"
				new_urls, new_data = self.parser.parse(url,html_cont)
				self.urls.add_new_urls(new_urls)
				self.outputer.collect_data(new_data)
				count += 1
				if count >1000:
					break
			except:
				print "craw failed"
		self.outputer.output_html()

if __name__ == "__main__":
	root_url = "https://baike.baidu.com/item/Python/407313?fr=aladdin"
	objSpider = spider_main()
	objSpider.craw(root_url)

urlManager.py

#coding=utf-8
class urlManager(object):
	def __init__(self):
		self.new_urls=set()
		self.old_urls=set()

	def add_new_url(self,url):
		if url is None:
			return
		if url not in self.new_urls and url not in self.old_urls:
			self.new_urls.add(url)

	def add_new_urls(self,urls):
		if urls is None or len(urls) == 0:
			return
		for url in urls:
			self.add_new_url(url)

	def has_new_url(self):
		return len(self.new_urls) != 0

	def get_new_url(self):
		if len(self.new_urls)>0:
			new_url = self.new_urls.pop()
			self.old_urls.add(new_url)
			return new_url

html_downloader.py

#coding=utf-8
import urllib2

class htmlDownloader(object):

	def download(self,url):
		user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
		headers = { 'User-Agent' : user_agent }

		if url is None:
			return None
			
		request = urllib2.Request(url,headers = headers)
		response = urllib2.urlopen(request)

		if response.getcode() != 200:
			return None

		return response.read()

html_parser.py

#coding=utf-8
from bs4 import BeautifulSoup
import urlparse
import re
class htmlParser(object):

	def _get_new_url(self,page_url,soup):
		new_urls = set()
		links = soup.find_all('a',href=re.compile(r"item/\S+"))
		for link in links:
			new_url = link['href']
			new_full_url = urlparse.urljoin(page_url,new_url)
			new_urls.add(new_full_url)
		return new_urls

	def _get_new_data(self,page_url,soup):
		new_data = {}
		new_data['url'] = page_url
		#if soup.find('h1') is None or soup.find('div',class_="lemma-summary") is None:
			#return
		new_data['title'] = soup.find('h1').get_text()
		new_data['summary'] = soup.find('div',class_="lemma-summary").get_text()
		return new_data

	def parse(self,page_url,html_cont):
		if page_url is None or html_cont is None:
			return

		soup = BeautifulSoup(html_cont,'html.parser',from_encoding='utf-8')
		new_urls = self._get_new_url(page_url,soup)
		new_data = self._get_new_data(page_url,soup)

		return new_urls,new_data

html_outputer.py

#coding=utf-8
class htmlOutputer(object):

	def __init__(self):
		self.datas=[]
	
	def collect_data(self,data):
		if data is None:
			return
		self.datas.append(data)

	def output_html(self):
		fout = open('output.html','w')

		fout.write('<html>')
		fout.write('<body>')
		fout.write('<table style="table-layout:fixed">')

		for data in self.datas:
			#fout.write('<tr>')
			#fout.write('<td width=100px>%s</td>' % data['url'].encode('utf-8'))

			#fout.write('<td width=100px>%s</td>' % data['title'].encode('utf-8'))
			#fout.write('<td width=600px>%s</td>' % data['summary'].encode('utf-8'))
			#fout.write('</tr>')
			fout.write('<p>link: %s</p>' % data['url'])
			fout.write('<p>title: %s</p>' % data['title'].encode('utf-8'))
			fout.write('<p>summary: %s</p>' % data['summary'].encode('utf-8'))
		fout.write('</table>')
		fout.write('</body>')
		fout.write('</html>')

遇到的问题：

1. 怎么将其他模块导入至spider_main.py里？

urlManager.py, html_downloader.py, html_parser.py, html_outputer.py与spider_main.py放在同一个目录下，就可以直接用import导入至spider_main里面。

2. 网页数据输出的顺序是怎样的？

网页地址是保存在set中，set是无序的，因此output出来的网页数据与网页中百度百科地址出现的顺序不一致。

3. page_url在所有程序中都没有定义，是怎么赋值的？

在spider_main.py中，是将url的值给了page_url "new_urls, new_data = self.parser.parse(url,html_cont)"，url的值类似："https://baike.baidu.com/item/Python/407313?fr=aladdin"，new_url的值类似"item/%E9%9B%86%E5%A4%96%E9%9B%86 %E6%8B%BE%E9%81%97"，通过"new_full_url = urlparse.urljoin(page_url,new_url)"得到的new_full_url的值类似"https://baike.baidu.com/item/%E9%9B%86%E5%A4%96%E9%9B%86 %E6%8B%BE%E9%81%97"。

4. 怎么调节python输出的html的表格宽度？

用width=...实现不了，暂时不知道怎么调节。因此将output的表格改成了普通的段落，输出结果如图所示：

chenhui229

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
慕课 Python开发简单爬虫之抓取百度百科1000个词条页面数据

最近学习了慕课上关于用Python开发简单爬虫的课程，并根据课程在本地实现了抓取百度百科1000个词条页面数据。课程地址为：https://www.imooc.com/learn/563目标：以Python百度百科网页为入口，爬取1000个相关的百度百科页面的title和简介，并以html格式output出来实现方法：模块功能 spider_main.py 为...
复制链接

扫一扫

专栏目录