scrapy爬取中国高考志愿填报网站各大学信息

一、环境准备

python3.8.3
pycharm
项目所需第三方包

pip install scrapy fake-useragent requests  virtualenv -i https://pypi.douban.com/simple

1.1创建虚拟环境
切换到指定目录创建

virtualenv .venv

创建完记得激活虚拟环境

1.2创建项目

scrapy startproject 项目名称

1.3使用pycharm打开项目,将创建的虚拟环境配置到项目中来
1.4创建京东spider

scrapy genspider 爬虫名称 url

1.4 修改允许访问的域名,删除https:

二、问题分析

查看网页响应的数据
在这里插入图片描述

发现返回的数据在json中
查看请求的网址
在这里插入图片描述
打开网址查看数据
在这里插入图片描述
拿到详情页请求地址,获取详情页数据,分别获取学校名称、学校邮箱、学校电话、学校电话、学校地址、学校邮编、学校网址

学校名称

学校邮箱

学校电话

学校地址

学校邮编

学校网址

三、spider

import json
import scrapy


from lianjia.items import china_school_Item


class ChinaSchoolSpider(scrapy.Spider):
    name = 'china_school'
    allowed_domains = ['api.eol.cn', 'static-data.eol.cn']
    start_urls = ['https://api.eol.cn/gkcx/api/?access_token=&admissions=&central=&department=&dual_class=&f211=&f985=&is_doublehigh=&is_dual_class=&keyword=&nature=&page=1&province_id=&ranktype=&request_type=1&school_type=&signsafe=&size=20&sort=view_total&top_school_id=&type=&uri=apidata/api/gk/school/lists']

    def parse(self, response):
        jsons = json.loads(response.text)
        ls = jsons.get('data')
        for l in ls.get('item'):
            school_id = l.get('school_id')
            school_url = 'https://gkcx.eol.cn/school/' + str(school_id)
            new_school_url = f'https://static-data.eol.cn/www/2.0/school/{school_id}/info.json'
            yield scrapy.Request(url=new_school_url, callback=self.detail_parse)
        for i in range(2, 143):
            nex_page_url = f'https://api.eol.cn/gkcx/api/?access_token=&admissions=&central=&department=&dual_class=&f211=&f985=&is_doublehigh=&is_dual_class=&keyword=&nature=&page={i}&province_id=&ranktype=&request_type=1&school_type=&signsafe=&size=20&sort=view_total&top_school_id=&type=&uri=apidata/api/gk/school/lists'
            yield scrapy.Request(url=nex_page_url, callback=self.parse)

    def detail_parse(self, response):
        item = china_school_Item()
        jsons = json.loads(response.text)
        data = jsons.get('data')
        school_name = data.get('name')
        school_email_one = data.get('email')
        school_email_two = data.get('school_email')
        school_address = data.get('address')
        school_postcode = data.get('postcode')
        school_site_one = data.get('site')
        school_site_two = data.get('school_site')
        school_phone_one = data.get('phone')
        school_phone_two = data.get('school_phone')
        item['school_name'] = school_name
        item['school_email_one'] = school_email_one
        item['school_email_two'] = school_email_two
        item['school_address'] = school_address
        item['school_postcode'] = school_postcode
        item['school_site_one'] = school_site_one
        item['school_site_two'] = school_site_two
        item['school_phone_one'] = school_phone_one
        item['school_phone_two'] = school_phone_two
        yield item

三、item

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class china_school_Item(scrapy.Item):
    # define the fields for your item here like:
    school_name = scrapy.Field()
    school_email_one = scrapy.Field()
    school_email_two = scrapy.Field()
    school_address = scrapy.Field()
    school_postcode = scrapy.Field()
    school_site_one = scrapy.Field()
    school_site_two = scrapy.Field()
    school_phone_one = scrapy.Field()
    school_phone_two = scrapy.Field()

四、setting

import random


from fake_useragent import UserAgent
ua = UserAgent()
USER_AGENT = ua.random
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = random.uniform(0.5, 1)
ITEM_PIPELINES = {
    'lianjia.pipelines.China_school_Pipeline': 300,
}

五、pipelines

class China_school_Pipeline:
    # def process_item(self, item, spider):
    #     return item
    def open_spider(self, spider):
        self.fp = open('./china_schook.xlsx', mode='w+', encoding='utf-8')
        self.fp.write('school_name\tschool_email_one\tschool_email_two\tschool_address\tschool_postcode\tschool_site_one\tschool_site_two\tschool_phone_one\tschool_phone_two\t\n')

    def process_item(self, item, spider):
        # 写入文件
        try:
            line = '\t'.join(list(item.values())) + '\n'
            self.fp.write(line)
            return item
        except:
            pass

    def close_spider(self, spider):
        # 关闭文件
        self.fp.close()
  • 3
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值