从网页中提取相关信息
**公司页面**: 公司的url,公司名称,规模,行业,在招岗位数量,邀面试数
1. 在scrapy shell中调试
在terminal/CMD中输入
scrapy shell
2019-04-08 22:32:43 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10e4f8908>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x10e4f8898>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
出现以上信息时,继续输入
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
resp = scrapy.Request(url='https://jobs.zhaopin.com/CC120752053J00179220206.htm',headers=headers)
fetch(resp)
2019-04-08 22:33:56 [scrapy.core.engine] INFO: Spider opened
2019-04-08 22:33:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://jobs.zhaopin.com/CC120752053J00179220206.htm> (referer: None)
出现上述字样后,输入response即可查看到连接状态
In [4]: response
Out[4]: <200 https://jobs.zhaopin.com/CC120752053J00179220206.htm>
2. 在职位页面提取数据
接上文,我们需要的数据是这些内容
职业页面: 职位的url,职位标题,工资,地区,学历,招聘人数
上图展示了,我们需要的信息与信息所在的位置
开始在scrapy shell中查找相关数据
-
更新时间 update_time
<span class="summary-plane__time"><i class="iconfont icon-update-time"></i>更新于 4月4日</span>
可以通过直接找到summary-plane__timed得出