知乎爬虫相关问题存到mongoDB中

最新推荐文章于 2021-08-04 10:45:55 发布

未完成的梦orz

最新推荐文章于 2021-08-04 10:45:55 发布

阅读量474

点赞数

分类专栏：爬虫文章标签： mongodb 爬虫知乎

本文链接：https://blog.csdn.net/qq_23392341/article/details/77992450

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

http://blog.csdn.net/lxb1022/article/details/75258021

按照这个博主的文章成功爬取到了知乎的内容，只是中间有两处改动。
1.输入验证码的地方raw_input改为了input
我们来看input的源码

def input(prompt):
  return eval(raw_input(prompt))

so,input等价于eval(raw_input(prompt))
其实input也是调用了raw_input，只是做了eval处理,而eval有什么作用呢？
eval函数是将字符串str当成有效Python表达式来求值，并返回计算结果。

input：会根据用户的输入来做类型的转换
raw_input：将所有输入作为字符串看待，不管用户输入什么类型的都会转变成字符串。

input和raw_input的区别：http://blog.csdn.net/dq_dm/article/details/45665069
python3中已经没有raw_input我用的python3，所以换成input没有问题

2.在第三篇结束要注意pipeline.py中的注释

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

在setting.py中设置ITEM_PIPELINES：
这个settings是为了激活pipeline组件，整数值决定了他们运行的顺序，具体看注释中提供的文档地址。

ITEM_PIPELINES = {
    'zhihu.pipelines.ZhihuPipeline': 300,
}

这样才能运行正确，让爬下来的数据存储到excel中。

表示自己后来又找方法存到了mongoDB中：便于数据的使用。修改了pipeline如下

class MongoPipeline(object):
    collection_name='test'

    def __init__(self,mongo_uri,mongo_db):
        self.mongo_uri=mongo_uri
        self.mongo_db=mongo_db
    @classmethod
    def from_crawler(cls,crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )
    def open_spider(self,spider):
        self.client=pymongo.MongoClient(self.mongo_uri)
        self.db=self.client[self.mongo_db]
    def close_spider(self,spider):
        self.client.close()
    def process_item(self,item,spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

并在settings.py中设置

ITEM_PIPELINES = {
    'zhihu.pipelines.MongoPipeline':300,
}

未完成的梦orz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
知乎爬虫相关问题存到mongoDB中

http://blog.csdn.net/lxb1022/article/details/75258021按照这个博主的文章成功爬取到了知乎的内容，只是中间有两处改动。 1.输入验证码的地方raw_input改为了input 我们来看input的源码def input(prompt): return eval(raw_input(prompt))so,input等价于eval(raw_inp
复制链接

扫一扫

专栏目录