爬虫能有多简单？看我三分钟教会你爬取百万图片。

最新推荐文章于 2024-09-12 18:31:52 发布

爬遍天下无敌手

最新推荐文章于 2024-09-12 18:31:52 发布

阅读量988

点赞数

分类专栏：程序员 Python 文章标签： python 爬虫猫咪

本文链接：https://blog.csdn.net/weixin_43881394/article/details/121438323

版权

本文介绍了如何使用Python快速搭建一个爬虫，通过BeautifulSoup和Scrapy框架，配合Requests和BloomFilter库，实现对猫咪图片的高效抓取。文章详细讲解了从分析网页结构到配置初始爬虫，再到处理图片Item和IP代理池的全过程，适合初学者入门。

摘要由CSDN通过智能技术生成

什么是爬虫?

如果是没有接触过爬虫的人可能会有些许疑惑，爬虫是个什么东西呢？其实爬虫的概念很简单，在互联网时代,万维网已然是大量信息的载体，如何有效地利用并提取这些信息是一个巨大的挑战。当我们使用浏览器对某个网站发送请求时，服务器会响应HTML文本并由浏览器来进行渲染显示。爬虫正是利用了这一点，通过程序模拟用户的请求，来获得HTML的内容，并从中提取需要的数据和信息。如果把网络想象成一张蜘蛛网，爬虫程序则像是蜘蛛网上的蜘蛛，不断地爬取数据与信息。

爬虫的概念非常简单易懂，利用python内置的urllib库都可以实现一个简单的爬虫，下面的代码是一个非常简单的爬虫，只要有基本的python知识应该都能看懂。它会收集一个页面中的所有<a>标签(没有做任何规则判断)中的链接，然后顺着这些链接不断地进行深度搜索。

from bs4 import BeautifulSoup
import urllib
import os
from datetime import datetime

# 网页的实体类,只含有两个属性,url和标题
class Page(object):
    def __init__(self,url,title):
        self._url = url
        self._title = title

    def __str__(self):
        return '[Url]: %s [Title]: %s' %(self._url,self._title)

    __repr__ = __str__

    @property
    def url(self):
        return self._url

    @property
    def title(self):
        return self._title

    @url.setter
    def url(self,value):
        if not isinstance(value,str):
            raise ValueError('url must be a string!')
        if value == '':
            raise ValueError('url must be not empty!')
        self._url = value

    @title.setter
    def title(self,value):
        if not isinstance(value,str):
            raise ValueError('title must be a string!')
        if value == '':
            raise ValueError('title must be not empty!')
        self._title = value

class Spider(object):

    def __init__(self,init_page):
        self._init_page = init_page # 种子网页,也就是爬虫的入口
        self._pages = []
        self._soup = None # BeautifulSoup 一个用来解析HTML的解析器

    def crawl(self):
        start_time = datetime.now()
        print('[Start Time]: %s' % start_time)
        start_timestamp = start_time.timestamp()
        tocrawl = [self._init_page] # 记录将要爬取的网页
        crawled = [] # 记录已经爬取过的网页
        # 不断循环,直到将这张图搜索完毕
        while tocrawl:
            page = tocrawl.pop()
            if page not in crawled:
                self._init_soup(page)
                self._packaging_to_pages(page)
                links = self._extract_links()
                self._union_list(tocrawl,links)
                crawled.append(page)
        self._write_to_curdir()
        end_time = datetime.now()
        print('[End Time]: %s' % end_time)
        end_timestamp = end_time.timestamp()
        print('[Total Time Consuming]: %f.3s' % (start_timestamp - end_timestamp) / 1000)

    def _init_soup(self,page):
        page_content = None
        try:
            # urllib可以模拟用户请求,获得响应的HTML文本内容
            page_content = urllib.request.urlopen(page).read()
        except:
            page_content = ''
        # 初始化BeautifulSoup,参数二是使用到的解析器名字    
        self._soup = BeautifulSoup(page_content,'lxml')

    def _extract_links(self):
        a_tags = self._soup.find_all('a') # 找到所有a标签
        links = []
        # 收集所有a标签中的链接
        for a_tag in a_tags:
            links.append(a_tag.get('href'))
        return links

    def _packaging_to_pages(self,page):
        title_string = ''
        try:
            title_string = self._soup.title.string # 获得title标签中的文本内容
        except AttributeError as e :
            print(e)
        page_obj = Page(page,title_string)
        print(page_obj)
        self._pages.append(page_obj)

    # 将爬取到的所有信息写入到当前目录下的out.txt文件
    def _write_to_curdir(self):
        cur_path = os.path.join(os.path.abspath('.'),'out.txt')
        print('Start write to %s' % cur_path)
        with open(cur_path,'w') as f:
            f.write(self._pages)

    # 将dest中的不存在于src的元素合并到src
    def _union_list(self,src,dest):
        for dest_val in dest:
            if dest_val not in src:
                src.append(dest_val)

    @property
    def init_page(self):
        return self._init_page

    @property
    def pages(self):
        return self._pages


def test():
    spider = Spider('https://sylvanassun.github.io/')
    spider.crawl()

if __name__ == '__main__':
    test()

但是我们如果想要实现一个性能高效的爬虫，那需要的复杂度也会增长，本文旨在快速实现，所以我们需要借助他人实现的爬虫框架来当做脚手架，在这之上来构建我们的图片爬虫(如果有时间的话当然也鼓励自己造轮子啦)。

BeautifulSoup

BeautifulSoup是一个用于从HTML和XML中提取数据的python库。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

利用好BeautifulSoup可以为我们省去许多编写正则表达式的时间，如果当你需要更精准地进行搜索时，BeautifulSoup也支持使用正则表达式进行查询。

BeautifulSoup3已经停止维护了，现在基本使用的都是BeautifulSoup4，安装BeautifulSoup4很简单，只需要执行以下的命令。

pip install beautifulsoup4

然后从bs4模块中导入BeautifulSoup对象，并创建这个对象。

from bs4 import BeautifulSoup

soup = BeautifulSoup(body,'lxml')

创建BeautifulSoup对象需要传入两个参数,第一个是需要进行解析的HTML内容，第二个参数为解析器的名字(如果不传入这个参数，BeautifulSoup会默认使用python内置的解析器html.parser)。BeautifulSoup支持多种解析器，有lxml、html5lib、html.parser。

第三方解析器需要用户自己安装，本文中使用的是lxml解析器，安装命令如下（它还需要先安装C语言库）。

pip install lxml

下面以一个例子演示使用BeautifulSoup的基本方式，如果还想了解更多可以去参考BeautifulSoup文档。

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</tit

最低0.47元/天解锁文章