使用scrapy框架爬取51job全国数据分析职位信息并做简单分析
工具:scrapy,MongoDB,Excel,tableau
1.分析网页链接,里面包含有【keyword=数据分析师&keywordtype=2&curr_page=1】这些关键信息,可以看出换页操作只需要更改curr_page的值。
2.查看网页源代码,发现职位信息都藏在源代码里,这样直接从源代码里提取职位信息就好了
3.先编写items,这里提取职位名称,薪资,公司名称,公司类型,公司规模,公司标签,所在城市,学历要求,工作经验,福利,招聘要求标签
class Job61Item(scrapy.Item):
jobname=scrapy.Field()
salary=scrapy.Field()
company=scrapy.Field()
companytype=scrapy.Field()
companyscale=scrapy.Field()
companytag=scrapy.Field()
city=scrapy.Field()
record=scrapy.Field()
workyear=scrapy.Field()
welfare=scrapy.Field()
requirements=scrapy.Field()
pass
4.再编写pipelines,把获取到的数据保存到MongoDB
import pymongo
class MongoPipeline(object):
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db