Scrapy添加代理爬取boss直聘,并存储到mongodb
最终爬取截图
项目创建
本项目使用的是Windows系统下的Pycharm平台,Python版本为3.6
使用scrapy startproject scrapy_boss
创建scrapy项目
items
from scrapy import Item, Field
class BossItem(Item):
company_name = Field() #公司名称
company_status = Field() #公司规模
company_address = Field() #公司地址
job_title = Field() #职位名称
job_salary = Field() #薪酬
job_detail = Field() #职位描述、详情要求
job_experience = Field() #工作经验
job_education = Field() #学历要求
job_url = Field() #发布页面
Spider
- 参数
class BossSpider(Spider):
name = 'boss'
allowed_domains = ['www.zhipin.com']
# start_urls = ['http://www.zhipin.com/']
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': '__c=1535347437; __g=-; __l=l=%2Fwww.zhipin.com%2Fc101010100%2Fh_101010100%2F%3Fquery%3D%25E7%2588%25AC%25E8%2599%25AB%25E5%25B7%25A5%25E7%25A8%258B%25E5%25B8%2588%26page%3D4%26ka%3Dpage-4&r=; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1533949950,1534555474,1535095432,1535347437; lastCity=101010100; toUrl=https%3A%2F%2Fwww.zhipin.com%2Fjob_detail%2F%3Fquery%3Dpython%26scity%3D101010100%26industry%3D%26position%3D; JSESSIONID=""; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1535348378; __a=5534803.1512627994.1535095432.1535347437.264.8.4.260',
'referer': 'https://www.zhipin.com/job_detail/?query=python&scity=101010100&industry=&position=',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
- 解析index页
def