想做一个数据分析的项目,需要数据,刚好前些天学了爬虫,突然想自己爬取数据了,我爬取的是前程无忧网。用的是scrapy框架爬取的。下面是代码:
首先是创建工程:scrapy startproject 工程名
我的是:scrapy startproject job1
进入工程:cd job1
在工程目录下创建项目:scrapy genspider 项目名 项目网址
scrapy genspider 51job 51job.com
目录如下图:
接着就是代码啦。
a51job.py
# -*- coding: utf-8 -*-
import scrapy
from ..items import Job1Item
class A51jobSpider(scrapy.Spider):
name = '51job'
allowed_domains = ['51job.com']
def __init__(self,place='全国',kw='数据分析',**kwargs):
# super().__init__()
self.place = place
self.kw = kw
self.place_code = {
# '杭州': '080200',
# '上海': '020000',
'全国':'000000',
}
self.start_urls = [
'https://search.51job.com/list/{place_code},000000,0000,00,9,99,{kw},2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='.format(
place_code=self.place_code[self.place], kw=self.kw)]
# start_urls = ['http://51job.com/']
def parse(self, response):
with open(response.url.split('?')[0][-7:],'wb') as f:
f.write(response.body)
jobs = response.xpath('//*[@id="resultList"]/div[@class="el"]')
for job in jobs:
# item = {}
item = Job1Item()
item['name'] = job.xpath('string(.//p[contains(@class,"t1")])').get().strip()
item['company'] = job.xpath('string(.//span[@class="t2"])').get().strip()
item['place'] = job.xpath('string(.//span[@class="t3"])').get().strip()
item['salary'] = job.xpath('string(.//span[@class="t4"])').get().strip()
item['post_time'] = job.xpath('string(.//span[@class="t5"])').get().strip()
yield item
# print('我生成了一条数据',item)
next_page = response.xpath('//a[text()="下一页"]')
if next_page:
# 获得绝对地址
next_page_url