enlightened by 挖掘机小王子

最新推荐文章于 2025-04-30 11:36:10 发布

stick to initial

最新推荐文章于 2025-04-30 11:36:10 发布

阅读量337

点赞数

分类专栏： python 文章标签： python mongodb

本文链接：https://blog.csdn.net/qq_40489435/article/details/104375223

版权

python 专栏收录该内容

2 篇文章

订阅专栏

ps：装环境，可以参考很多博客，该博客只提供思路和本人的自我总结

scrapy框架的使用流程分为四步：

scrapy startproject jobSpider
cd jobSpider
scrapy genspider job
edit this job.py
scrapy crawl job

我们以这个起始页面开始 start_urls = [‘https://search.51job.com/list/020000,000000,0000,00,9,99,Python%2520%25E9%25AB%2598%25E7%25BA%25A7,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=’]

    def parse(self, response):
        selectors = response.xpath('//div[@class="el"]')
        for selector in selectors:
            url = selector.xpath('./p/span/a/@href').get(default='')
            if url:
                print(url)
                yield scrapy.Request(url, callback=self.parseDetail)

    def parseDetail(self, response):
        corporation_name = response.xpath('//p[@class="cname"]/a/@title').get(default='')
        post_name = response.xpath('//div[@class="cn"]/h1/@title').get(default='')
        post_wage = response.xpath('//div[@class="cn"]/strong/text()').get(default='')

        # items = {
        #     '公司': corporation_name,
        #     '岗位': post_name,
        #     '工资': post_wage
        # }
        items = JobspiderItem(name=corporation_name,post=post_name,wage=post_wage)
        # self.result.append(items)
        print(items)
        yield items

修改settings.py里面的请求头来伪装浏览器请求，并且根据你要搜集的数据编写items.py如下：

import scrapy


class JobspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass
    name = scrapy.Field()
    post = scrapy.Field()
    wage = scrapy.Field()

由于我们要将数据存到数据库里面，故修改pipelines.py文件如下

from pymongo import MongoClient as mc
from jobSpider.settings import *

class JobspiderPipeline(object):
    def __init__(self):
        self.host = MONGO_HOST
        self.port = MONGO_PORT
        self.client = mc(self.host, self.port)
        self.db = self.client[MONGO_DB]
        self.collection = self.db[MONGO_COLLECTION]

    def process_item(self, item, spider):
        if not isinstance(item, dict):
            item = dict(item)
        self.collection.insert_one(item)
        return item

**其中在settings.py中加入了如下的内容，为的就是不用每次都去pipelins里面去改变连接数据库的一些数据如端口和连接ip等等**


#启用一个Item Pipeline组件
ITEM_PIPELINES = {
    'jobSpider.pipelines.JobspiderPipeline': 10,
}
# mongoDB ,import data into it
MONGO_HOST = '127.0.0.1'
MONGO_PORT = 27017
MONGO_DB = 'Job'
MONGO_COLLECTION = 'job'

这样就可以把扒来的数据存到MongoDB里面去。当然可以通过执行scrapy crawl job -o job.csv 或者 scrapy crawl job -o job.json，如果是第二种就不需要去改pipelines，只需要job.py里面yield数据就可以了