爬虫入门（三）连接mongodb

最新推荐文章于 2023-10-11 14:05:30 发布

围巾的ACM

最新推荐文章于 2023-10-11 14:05:30 发布

阅读量2.2k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/qq_21057881/article/details/71079920

版权

python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

连接mongodb

虽然说我们前面写了一个比较健壮的爬虫了，但是人生难免有意外，万一中断了，我们又要重新开始爬虫下载图片了，抓狂！那么我们想呢，怎么写一个判断图片有没有下载过呢？显然我们不能在文件夹里遍历….会慢到爆炸的，那么我们就可以借助数据库来实现去重啦

环境 ubuntu 16.04 python3.6.1 数据库mongodb

mongodb的一些基本操作在前面的博文有哟可以去看看或者自行百度

就讲几个常用的操作吧

show dbs 这是进入mongo的控制台后，查看有哪些数据库

use xxx 进入mongo控制台后，进入哪些数据库

show collections 查看当前你进入的数据库有多少个集合

db.xxx.stats() 进入控制台后，查看当前集合的状态

db.xxx.find() 查看当前集合的所有数据，默认显示20条

db.xxx.find(xxxx) 查找数据，没有返回None

导入mongo的库 `from pymongo import MongoClient`

在程序中呢我们先写一个初始化函数来获得数据库的连接,顺便把之前的jpg_download_list和url都放进去了

def __init__ (self):
    self.client = MongoClient()
    self.db = self.client['mzituPic']
    self.collection = self.db['mzitu']
    self.jpg_download_list = []
    self.base_url = 'http://www.mzitu.com'

前面三段代码的意思呢就是首先获得Mongo的监听，然后我们进入mzituPic这个数据库，然后获取这个数据库中的mzitu集合，很好理解吧？

然后呢我们在getAllJpg_Info()这个函数加入一下数据库

    for jpgpage in range(1,int(total)+1):
        now_url = '{}/{}'.format(url,jpgpage)
        selector = html.fromstring(self.GetRespon(now_url,10).text)
        try:
            s = selector.xpath('//div[@class="main-image"]/p/a/img/@src')[0]
        except:
            continue
        if self.collection.find_one({'url':s}):
            print(u'这个图片已被爬取，跳过')
        else:
            self.jpg_download_list.append(s)
            self.collection.insert({'主题':title,'url':s})