目的:把腾讯社招的每个分页的职位名称及链接、类型、人数、工作地点、发布日期爬取下来,然后存储到json文件里面
思路:
1. 新建爬虫项目
2. 在items.py文件里面设置存储的字段名称及类型
3. 在spiders文件夹里面设置爬虫文件
4. 设置管道文件
5. 设置settings.py文件
6. 测试运行
实际操作流程如下:
1. 新建爬虫项目tencent
2. 在items.py文件里面设置存储的字段名称及类型
3. 在spiders文件夹里面设置爬虫文件
4. 设置管道文件
5. 设置settings.py文件
6. 测试运行
7 爬取的数据结果tencent.json内容,共373页,3733条数据。
爬取腾讯社招的CrawlSpider版
创建爬虫项目tencent_CrawlSpiders
scrapy startproject tencent_CrawlSpiders
items.py文件
# -*- coding: utf-8 -*- import scrapy class TencentCrawlspidersItem(scrapy.Item): # 职位名称 job_Title = scrapy.Field() # 详细链接 job_Link = scrapy.Field() # 职位类型 job_Type = scrapy.Field() # 职位人数 job_Number = scrapy.Field() # 工作位置 job_Location = scrapy.Field() # 发布日期 job_PublicDate = scrapy.Field()
创建爬虫文件tencent_xiaozhao.py,指定爬取的域名范围
scrapy genspider -t crawl tencent_shezhao hr.tencent.com
爬虫文件tencent_xiaozhao.py
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from tencent_CrawlSpiders.items import TencentCrawlspidersItem class TencentXiaozhaoSpider(CrawlSpider): name = 'tencent_shezhao' allowed_domains = ['tencent.com'] start_urls = ['https://hr.tencent.com/position.php?&start=0#a'] # 获取匹配分页页码的链接的正则表达式 page_link = LinkExtractor(allow='start=\d+') rules = ( Rule(page_link, callback='parse_item', follow=True), ) def parse_item(self, response): job_list = response.xpath('//tr[@class="odd"] | //tr[@class="even"]') item = TencentCrawlspidersItem() for each in job_list: item['job_Title'] = each.xpath('./td[1]/a/text()')[0].extract() item['job_Link'] = each.xpath('./td[1]/a/@href')[0].extract() item['job_Type'] = each.xpath('./td[2]/text()').extract() item['job_Number'] = each.xpath('./td[3]/text()')[0].extract() item['job_Location'] = each.xpath('./td[4]/text()')[0].extract() item['job_PublicDate'] = each.xpath('./td[5]/text()')[0].extract() yield item
piplines.py文件
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json class TencentCrawlspidersPipeline(object): def process_item(self, item, spider): self.filename = open('tencent.json','a') jsontext = json.dumps(dict(item),ensure_ascii = False).encode('utf-8') + '\n' self.filename.write(jsontext) self.filename.close() return item
settings.py文件
# -*- coding: utf-8 -*- BOT_NAME = 'tencent_CrawlSpiders' SPIDER_MODULES = ['tencent_CrawlSpiders.spiders'] NEWSPIDER_MODULE = 'tencent_CrawlSpiders.spiders' # 设置日志文件及级别,大于及等于INFO级别的会保存到日志里面,终端执行时,会发现没有信息提示 LOG_FILE = 'tencent_CrawlSpiders.log' LOG_LEVEL = 'INFO' ITEM_PIPELINES = { 'tencent_CrawlSpiders.pipelines.TencentCrawlspidersPipeline': 300, }