Python爬虫【实战篇】scrapy 框架爬取某招聘网存入mongodb

最新推荐文章于 2023-12-03 11:54:36 发布

weixin_30446197

最新推荐文章于 2023-12-03 11:54:36 发布

阅读量191

点赞数

文章标签：数据库 python 爬虫

原文链接：http://www.cnblogs.com/tangkaishou/p/10264628.html

版权

创建项目

scrapy startproject zhaoping

创建爬虫

cd zhaoping
scrapy genspider hr zhaopingwang.com

目录结构

items.py

    title = scrapy.Field()
    position = scrapy.Field()
    publish_date = scrapy.Field()

pipelines.py

from pymongo import MongoClient

mongoclient = MongoClient(host='192.168.226.150',port=27017)
collection = mongoclient['zhaoping']['hr']

class TencentPipeline(object):
    def process_item(self, item, spider):
        print(item)
        # 需要转换为 dict
        collection.insert(dict(item))
        return item

spiders/hr.py

    def parse(self, response):
        # 不要第一个 和最后一个
        tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1]
        for tr in tr_list:
            item = TencentItem()
            # xpath 从1 开始数起
            item["title"] = tr.xpath("./td[1]/a/text()").extract_first()
            item["position"] = tr.xpath("./td[2]/text()").extract_first()
            item["publish_date"] = tr.xpath("./td[5]/text()").extract_first()
            yield item

        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        # 构造url
        if next_url != "javascript:;":
            print(next_url)
            next_url = "https://hr.tencent.com/" + next_url
            yield scrapy.Request(url=next_url,callback=self.parse,)

就是这么简单，就获取到数据

转载于:https://www.cnblogs.com/tangkaishou/p/10264628.html

weixin_30446197

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫【实战篇】scrapy 框架爬取某招聘网存入mongodb

创建项目scrapy startproject zhaoping创建爬虫cd zhaopingscrapy genspider hr zhaopingwang.com目录结构items.py title = scrapy.Field() position = scrapy.Field() publish_date = s...
复制链接

扫一扫