步骤
确定目标:确定抓取哪个网站的哪些网页的哪部分数据。本实例确定抓取百度百科python词条页面以及它相关的词条页面的标题和简介。
分析目标:确定抓取数据的策略。一是分析要抓取的目标页面的URL格式,用来限定要抓取的页面的范围;二是分析要抓取的数据的格式,在本实例中就是要分析每一个词条页面中标题和简介所在的标签的格式;三是分析页面的编码,在网页解析器中指定网页编码,才能正确解析。
编写代码:在解析器中会使用到分析目标步骤所得到的抓取策略的结果。
执行爬虫。
确定框架
20171031-baike
spider_main.py 是爬虫主体
url_manager.py 维护了两个集合,用来记录要爬取的 url 和已爬取的 url
html_downloader.py 调用了 urllib 库来下载 html 文档
html_parser.py 调用了 BeautifulSoup 来解析 html 文档
html_outputer.py 把解析后的数据存储起来,写入 output.html 文档中
url_manager
class UrlManager(object):
def __init__(self):
# 初始化两个集合
self.new_urls = set()
self.old_urls = set()
def add_new_url(self, url):
if url is None:
return
if url not in self.new_urls or self.old_urls:
# 防止重复爬取
self.new_urls.add(url)
def add_new_urls(self, urls):
if urls is None or len(urls) == 0:
return
for url in urls:
# 调用子程序
self.add_new_url(url)
def has_new_url(self):
return len(self.new_urls) != 0
def get_new_url(self):
new_url = self.new_urls.pop()
self.old_urls.add(new_url)
return new_url
解释:管理器维护了两个集合(new_urls、old_urls),分别记录要爬和已爬 url,注意到前两个 add 方法,一个是针对单个 url,一个是针对 url 集合,不要忘记去重操作。
html_downloader
# coding:utf-8
import urllib.request
class HtmlDownloader(object):
def download(self, url):
if url is None:
return None
response = urllib.request.urlopen(url)
if response.getcode() != 200: # 判断是否请求成功
return None
return response.read()
解释:很直观的下载,这是最简单的做法
html_parser
from bs4 import BeautifulSoup
import urllib.parse
import re
class HtmlParser(object):
def _get_new_urls(self, page_url, soup):
new_urls = set()
links = soup.find_all('a', href = re.compile(r"/item/"))
for link in links:
new_url = link['href']
new_full_url = urllib.parse.urljoin(page_url, new_url)
new_urls.add(new_full_url)
return new_urls
def _get_new_data(self, page_url, soup):
res_data = {}
res_data['url'] = page_url
title_node = soup.find('dd', class_="lemmaWgt-lemmaTitle-title").find("h1")
res_data['title'] = title_node.get_text()
summary_node = soup.find('div', class_="lemma-summary")
res_data['summary'] = summary_node.get_text()
return res_data
def parse(self, page_url, html_cont):
if page_url is None or html_cont is None:
return
soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')
new_urls = self._get_new_urls(page_url, soup)
new_data = self._get_new_data(page_url, soup)
return new_urls, new_data
解释:在解析器中,注意到 parse 方法,它从 html 文档中找到所有词条链接,并将它们包装到 new_urls 集合中,最后返回,同时,它还会解析出 new_data 集合,这个集合存放了词条的名字(title)以及摘要(summary)。
spider_main
# coding:utf-8
from baike_spider import url_manager, html_downloader, html_parser, html_outputer
import logging
class SpiderMain(object):
def __init__(self):
self.urls = url_manager.UrlManager()
self.downloader = html_downloader.HtmlDownloader()
self.parser = html_parser.HtmlParser()
self.outputer = html_outputer.HtmlOutputer()
def crawl(self, root_url):
count = 1 # record the current number url
self.urls.add_new_url(root_url)
while self.urls.has_new_url():
try:
new_url = self.urls.get_new_url()
print('crawl No.%d: %s'%(count, new_url))
html_cont = self.downloader.download(new_url)
new_urls, new_data = self.parser.parse(new_url, html_cont)
self.urls.add_new_urls(new_urls)
self.outputer.collect_data(new_data)
if count == 1000:
break
count += 1
except:
logging.warning('crawl failed')
self.outputer.output_html()
if __name__ == "__main__":
root_url = "https://baike.baidu.com/item/Python/407313"
obj_spider = SpiderMain()
obj_spider.crawl(root_url)
解释:主程序将会从 “Python” 的词条页面进入,然后开始爬取数据。注意到,每爬取一个页面,都有可能有新的 url 被解析出来,所以要交给 url_manager 管理,然后将 new_data 收集起来,当跳出 while 循环时,将数据输出(因数据量不大,直接存放在内存中)。
html_outputer
class HtmlOutputer(object):
def __init__(self):
self.datas = [] # 列表
def collect_data(self, data):
if data is None:
return
self.datas.append(data)
def output_html(self):
with open('output.html', 'w', encoding='utf-8') as fout:
fout.write("")
fout.write("
")fout.write("
")fout.write("
for data in self.datas:
fout.write("
")fout.write("
%s" % data["url"])fout.write("
%s" % data["title"])fout.write("
%s" % data["summary"])fout.write("
")fout.write("
")fout.write("")
fout.write("")
解释:注意编码问题就好
输出:
"C:\Program Files\Python36\python.exe" D:/PythonProject/immoc/baike_spider/spider_main.py
crawl No.1: https://baike.baidu.com/item/Python/407313
crawl No.2: https://baike.baidu.com/item/Zope
crawl No.3: https://baike.baidu.com/item/OpenCV
crawl No.4: https://baike.baidu.com/item/%E5%BA%94%E7%94%A8%E7%A8%8B%E5%BA%8F
crawl No.5: https://baike.baidu.com/item/JIT
crawl No.6: https://baike.baidu.com/item/%E9%9C%80%E6%B1%82%E9%87%8F
crawl No.7: https://baike.baidu.com/item/Linux
crawl No.8: https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%BC%96%E7%A8%8B
crawl No.9: https://baike.baidu.com/item/Pylons
crawl No.10: https://baike.baidu.com/item/%E4%BA%A7%E5%93%81%E8%AE%BE%E8%AE%A1
Process finished with exit code 0
output.html
20171031-baikeout