数据分析项目（一）——爬虫篇

最新推荐文章于 2024-08-21 11:03:15 发布

加码未来-杨老师

最新推荐文章于 2024-08-21 11:03:15 发布

阅读量3k

点赞数

分类专栏：数据分析文章标签：数据分析爬虫

本文链接：https://blog.csdn.net/m0_37906230/article/details/84644752

版权

想做一个数据分析的项目，需要数据，刚好前些天学了爬虫，突然想自己爬取数据了，我爬取的是前程无忧网。用的是scrapy框架爬取的。下面是代码：
首先是创建工程：scrapy startproject 工程名
我的是：scrapy startproject job1
进入工程：cd job1
在工程目录下创建项目：scrapy genspider 项目名项目网址
scrapy genspider 51job 51job.com
目录如下图：
在这里插入图片描述
接着就是代码啦。
a51job.py

# -*- coding: utf-8 -*-
import scrapy

from ..items import Job1Item


class A51jobSpider(scrapy.Spider):
    name = '51job'
    allowed_domains = ['51job.com']
    def __init__(self,place='全国',kw='数据分析',**kwargs):
        # super().__init__()
        self.place = place
        self.kw = kw
        self.place_code = {
            # '杭州': '080200',
            # '上海': '020000',
            '全国':'000000',
        }
        self.start_urls = [
            'https://search.51job.com/list/{place_code},000000,0000,00,9,99,{kw},2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare='.format(
                place_code=self.place_code[self.place], kw=self.kw)]

    # start_urls = ['http://51job.com/']

    def parse(self, response):
        with open(response.url.split('?')[0][-7:],'wb') as f:
            f.write(response.body)
        jobs = response.xpath('//*[@id="resultList"]/div[@class="el"]')
        for job in jobs:
            # item = {}
            item = Job1Item()
            item['name'] = job.xpath('string(.//p[contains(@class,"t1")])').get().strip()
            item['company'] = job.xpath('string(.//span[@class="t2"])').get().strip()
            item['place'] = job.xpath('string(.//span[@class="t3"])').get().strip()
            item['salary'] = job.xpath('string(.//span[@class="t4"])').get().strip()
            item['post_time'] = job.xpath('string(.//span[@class="t5"])').get().strip()

            yield item
            # print('我生成了一条数据',item)

        next_page = response.xpath('//a[text()="下一页"]')
        if next_page:
            # 获得绝对地址
            next_page_url