Python开发简单爬虫

最新推荐文章于 2024-10-23 15:17:05 发布

原创最新推荐文章于 2024-10-23 15:17:05 发布 · 612 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫

Python学习初涉专栏收录该内容

14 篇文章

订阅专栏

本文是《Python开发简单爬虫》课程的学习笔记，介绍了爬虫的基本概念、架构，以及如何使用Python的urllib2和BeautifulSoup模块进行网页下载和解析。通过实战案例，演示了爬取百度百科1000个页面数据的过程。

慕课网上《Python开发简单爬虫》课程个人总结笔记，侵删。
慕课网视频课程链接：https://www.imooc.com/learn/563

一、课程介绍

开发轻量级爬虫（不需要登陆的静态网络）
内容包含：爬虫简介、爬虫简单架构
架构三大模块：URL管理器、网页下载器、网页解析器
完整示例：爬取百度百科Python词条相关的1000个页面数据

二、爬虫简介及爬虫的技术价值

2.1 爬虫是什么？
自动抓取互联网信息的程序

2.2 爬虫技术的价值？
互联网数据，为我所用。
做出产品：新闻聚合阅读器、爆笑故事APP等

三、简单爬虫架构

3.1 Python爬虫架构
包含：爬虫调度端、URL管理器、网页下载器、网页解析器
爬虫调度端：启动爬虫、调度爬虫、监视爬虫
URL管理器：未爬虫的URL和已爬虫的URL的管理
网页下载器：从URL管理器中取出未爬虫的URL，网页下载存储成字符串
网页解析器：解析字符串得到价值信息，得到新的URL添加到URL管理器

3.2 Python爬虫架构的动态运行流程
调度器询问URL管理器是否有待爬取的YRL，返回是和否
如果是，则取出一个待爬取的URL，传送给下载器
下载器下载完毕后，将URL内容传给调度器
调度器将URL内容传给解析器解析URL内容
解析器解析完毕后，将价值数据和新URL列表返回给调度器
调度器手机价值数据给应用，并新增URL到待爬取URL

四、URL管理器和实现方法

4.1 Python爬虫URL管理
管理待抓取的URL集合和已抓取的URL集合
防止重复抓取和循环抓取

URL管理器支持以下功能：
添加新URL到待爬取集合中
判断待添加URL是否在容器中
判断是否还有待爬取URL
获取待爬取URL
将URL从待爬取移动到已爬取

4.2 Python爬虫URL管理器的实现方式
内存：Python、set()
关系数据库：MySQL、urls
缓存数据库：redis、set

五、网页下载器和urlib2模块

5.1 Python爬虫网页下载器简介
将互联网上URL对应的网页下载到本地的工具
将URL对应的互联网以HTML下载到本地存储成本地文件
下载器种类：urllib2、requests

5.2 Python爬虫urlib2下载器网页的三种方法
方法一：urllib2.urlopen(url)
方法二：data、http header
方法三：HTTPCookieProcessor

5.3 Python爬虫urlib2实例代码演示
已修改为python3.x的代码

# _*_ coding:utf8 _*_

import urllib.request
import http.cookiejar

url = "http://www.baidu.com"

print ('first way')
response1 = urllib.request.urlopen(url)
print (response1.getcode())
print (len(response1.read()))

print ('second way')
request = urllib.request.Request(url)
request.add_header("user-agent","Mozilla/5.0")
response2 = urllib.request.urlopen(url)
print (response2.getcode())
print (len(response2.read()))

print ('third way')
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
urllib.request.install_opener(opener)
response3 = urllib.request.urlopen(url)
print (response3.getcode())
print (len(response3.read()))
print (cj)
print (response3.read())

六、网页解析器和BeautifulSoup第三方模块

6.1 Python爬虫网页解析器简介
从网页中提取有价值数据的工具
以下载好的Html网页字符串为输入，获取价值数据和新URL列表
类型：正则表达式（模糊匹配）、html.parser、BeautifulSoup、lxml（结构化解析）
结构化解析：DOM（Document Object Model）树

6.2 BeautifulSoup模块介绍和安装
Python第三方库，用于从HTML;或XML中提取数据
官网：https://www.crummy.com/software/BeautifulSoup/
安装：pip install beautifulsoup4
测试：import bs4

6.3 BeautifulSoup的语法
创建BeautifulSoup对象、搜索节点find_all与find、访问节点名称属性文字

<a href='123.html' class='article_link'> Python </a>
节点名称：a
节点属性：href='123.html'
节点属性：class='article_link'
节点内容：Python

语法：

1、创建BeautifulSoup对象
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(
                         html_doc,               #HTML文档字符串
                         'html.parser'           #HTML解析器
                         from_encoding='utf8'    #HTML文档的编码
                         )


2、搜索节点（find_all,find）
    #方法：find_all(name,attrs,string)

    #查找所有标签为a的节点
    soup.find_all('a')

    #查找所有标签为a，连接符合/view/123.htm形式的节点
    soup.find_all('a',href='/view/123.htm')
    soup.find_all('a',href=re.compile(r'/view/\d+\.htm))

    #查找所有标签为div，class为abc，文字为Python的节点
    soup.find_all('div',class_='abc',string='Python')


3、访问节点信息
    #得到节点：<a href='1.html'>Python</a>

    #获取查找到的a节点的标签名称
    node.name

    #获取查找到的a节点的href属性
    node['href']

    #获取查找到的a节点的链接文字
    node.get_text()

6.4 BeautifulSoup实例测试

# _*_ coding:utf8 _*_

from bs4 import BeautifulSoup
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')

print ('get all links')
links = soup.find_all('a')
for link in links:
    print (link.name,link['href'],link.get_text())


print ('get link of lacie')
link_node = soup.find('a',href='http://example.com/lacie')
print (link_node.name,link_node['href'],link_node.get_text())


print ('regular match')
link_node = soup.find('a',href=re.compile(r"ill"))
print (link_node.name,link_node['href'],link_node.get_text())

print ('get text of p')
p_node = soup.find('p',class_="title")
print (p_node.name,p_node.get_text())

七、实战演练-爬取百度百科1000个页面的数据

7.1 Python爬虫实例-分析目标
流程：确定目标、分析目标、编写代码、执行爬虫
分析目标：URL格式、数据格式、网页编码

目标：百度百科Python词条相关词条网页-标题和简介
入口页：https://baike.baidu.com/item/Python/407313
URL格式：/item/源代码/3969
数据格式：<dd class="lemmaWgt-lemmaTitle-title"><h1>***</h1></dd>
简介：<div class="lemma-summary">***<div>
页面编码：UTF-8

7.2 调度程序

class SpiderMain(object):
    def __init__(self):
        self.urls = UrlManager()
        self.downloader = HtmlDownloader()
        self.parser = HtmlParser()
        self.outputer = HtmlOutputer()
    def craw(self,root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print ('craw %d: %s' % (count,new_url))
                html_cont = self.downloader.download(new_url)
                new_urls,new_data = self.parser.parse(new_url,html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)
                if count == 1000:
                    break
                count = count + 1
            except:
                print ('craw failed')
        self.outputer.output_html()


if __name__ == "__main__":
    root_url = "https://baike.baidu.com/item/python/407313"
    obj_spider = SpiderMain()
    obj_spider.craw(root_url)

7.3 URL管理器

class UrlManager(object):
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()
    def add_new_url(self,url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)
    def add_new_urls(self,urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)
    def has_new_url(self):
        return len(self.new_urls) != 0
    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

7.4 HTML下载器html_downloader

class HtmlDownloader(object):
    def download(self,url):
        if url is None:
            return None
        response = urllib.request.urlopen(url,timeout=1)
        if response.getcode() != 200:
            return None
        return response.read()

7.5 HTML解析器html_parser

class HtmlParser(object):
    def _get_new_urls(self,page_url,soup):
        new_urls = set()
        #/%s/%d
        links = soup.find_all('a',href=re.compile(r"/item/%"))
        for link in links:
            new_url = link['href']
            new_full_url = urllib.parse.urljoin(page_url,new_url)
            new_urls.add(new_full_url)
        return new_urls
    def _get_new_data(self,page_url,soup):
        res_data = {}
        #url
        res_data['url'] = page_url
        #<dd class="lemmaWgt-lemmaTitle-title"> <h1>Python</h1>
        title_node = soup.find('dd',class_="lemmaWgt-lemmaTitle-title").find("h1")
        res_data['title'] = title_node.get_text()
        #<div class="lemma-summary">
        summary_node = soup.find('div',class_="lemma-summary")
        res_data['summary'] = summary_node.get_text()
        return res_data
    def parse(self,page_url,html_cont):
        if page_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont,'html.parser',from_encoding='utf-8')
        new_urls = self._get_new_urls(page_url,soup)
        new_data = self._get_new_data(page_url,soup)
        return new_urls,new_data

7.6 HTML输出器

class HtmlOutputer(object):
    def __init__(self):
        self.datas = []
    def collect_data(self,data):
        if data is None:
            return
        self.datas.append(data)
    def output_html(self):
        fout = open('output.html','w',encoding="utf-8")
        fout.write("<html>")
        fout.write("<body>")
        fout.write("<table>")
        #ascii
        for data in self.datas:
            fout.write("<tr>")
            fout.write("<td>%s</td>" % data['url'])
            fout.write("<td>%s</td>" % data['title'])
            fout.write("<td>%s</td>" % data['summary'])
            fout.write("<tr>")
        fout.write("</table>")
        fout.write("</body>")
        fout.write("</html>")
        fout.close()

7.7 开始运行爬虫和爬取结果展示
在这里插入图片描述