自己动手写搜索引擎（2）之Python 爬虫

最新推荐文章于 2024-07-22 06:34:25 发布

chufanwmmmz5723

最新推荐文章于 2024-07-22 06:34:25 发布

阅读量138

点赞数

文章标签：爬虫 python

原文链接：https://my.oschina.net/u/3090863/blog/796334

版权

Python简单爬虫 爬虫调度器，主要是用来调度网页解析器，url管理器，html解析器，html输出器，html下载器，给出种子网址，并进行任务操作。 [code lang="python"] from baike_spider import url_manager, html_download, html_output, html_parser from asyncio.tasks import sleep class SpiderMain(object): def __init__(self): self.urls=url_manager.UrlManager() self.download=html_download.HtmlDownload() self.output=html_output.HtmlOutput() self.parser = html_parser.HtmlParser() def craw(self, root_url): count = 1 self.urls.add_new_url(root_url) while self.urls.has_new_url(): try: new_url = self.urls.get_new_url() print ('craw %d : %s'%(count,new_url)) html_cont = self.download.download(new_url) new_urls,new_data = self.parser.parser(new_url,html_cont) self.urls.add_new_urls(new_urls) self.output.collect_data(new_data) if count==500: break count=count+1 except: print ('craw failed') self.output.output_html() if __name__ == "__main__": root_url="http://baike.baidu.com/view/582.htm" obj_spider=SpiderMain() obj_spider.craw(root_url) [/code]

url管理器，用来管理爬取到的url，将网页链接取出，并与爬取过得网页链接进行比较，通过之后交给下载器爬取。 [code lang="python"] class UrlManager(object): def __init__(self): self.new_urls = set() self.old_urls = set() def add_new_url(self,url): if url is None: return if url not in self.new_urls and url not in self.old_urls: self.new_urls.add(url) def add_new_urls(self,urls): if urls is None or len(urls)==0: return for url in urls: self.add_new_url(url) def has_new_url(self): return len(self.new_urls) != 0 def get_new_url(self): new_url = self.new_urls.pop() self.old_urls.add(new_url) return new_url [/code]

网页下载器，拿到url管理器给的链接后进行下载。这里使用Pyhton自带的urllib库。 [code lang="python"] from urllib import request class HtmlDownload(object): def download(self,url): if url is None: return None response = request.urlopen(url) if response.getcode() != 200: return None return response.read() [/code]

网页解析器，主要是将下载到的网页进行解析，取得想要的数据,这里使用的事强大BeautifulSoup，然后抽取了每个百科网页的标题及简介和主要内容。 [code lang="python"] from bs4 import BeautifulSoup import re import urllib.parse class HtmlParser(object): def _get_new_urls(self, page_url, soup): new_urls = set() links = soup.find_all('a',href=re.compile(r"/view/\d+\.htm")) for link in links: new_url=link['href'] new_full_url = urllib.parse.urljoin(page_url,new_url) new_urls.add(new_full_url) return new_urls def _get_new_data(self, page_url, soup): res_data={} #url res_data['url']=page_url #title title_node=soup.find('dd',class_="lemmaWgt-lemmaTitle-title").find("h1") res_data['title'] = title_node.get_text() #简介 summary_node = soup.find('div',class_="lemma-summary") res_data['summary']=summary_node.get_text() #百度百科主内容 div main_node = soup.find('div',class_="main-content") res_data['miandiv']=main_node.get_text() #main_css = soup.find('link',ref_="stylesheet") #res_data['main_css']=main_css.get_all() return res_data def parser(self,page_url,html_cont): if page_url is None or html_cont is None: return soup = BeautifulSoup(html_cont,"html.parser",from_encoding='utf-8') new_urls = self._get_new_urls(page_url,soup) new_data = self._get_new_data(page_url,soup) return new_urls,new_data [/code]

网页输出器，这里将爬取到的网页以html为后缀文件形式输出，主要要注意的是编码问题，Python 默认是使用unicode编码，而windows是默认GBk写文件，所以需要进行相应转换。以及写文件时需注意以“?force=1”为后缀的网页链接，windows的文件名是不支持输入"?"的，所以进行替换。 [code lang="python"] ''' Created on 2016年7月5日 @author : HCQ ''' class HtmlOutput(object): def __init__(self): self.datas = [] def collect_data(self,data): if data is None: return self.datas.append(data) def output_html(self): ##ascii for data in self.datas: fname=data['url'][7:].replace("/","_").replace('?force=1','') fout=open(r"D:\spider\newbaike"+"\\"+fname+'l', 'w',encoding='utf-8') #保存的目录 fout.write("<html>") fout.write("<head>") fout.write("<meta charset='utf-8'>") fout.write('<title>'+data['title']+'</title>') fout.write("</head>") fout.write("<body>") fout.write('<div class="main-content">') fout.write(data['miandiv']) fout.write("</div>") fout.write("<table>") fout.write("<tr>") fout.write("</tr>") fout.write("</table>") fout.write("</body>") fout.write("</html>") fout.close() [/code]

转载于:https://my.oschina.net/u/3090863/blog/796334

chufanwmmmz5723

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
自己动手写搜索引擎（2）之Python 爬虫

Python简单爬虫爬虫调度器，主要是用来调度网页解析器，url管理器，html解析器，html输出器，html下载器，给出种子网址，并进行任务操作。 [code lang="python"] from baike_spider import url_manager, html_down...
复制链接

扫一扫