scrapy爬取链家经纪人的数据（详细教程）

最新推荐文章于 2023-04-27 10:35:32 发布

for_syq

最新推荐文章于 2023-04-27 10:35:32 发布

阅读量957

点赞数 3

分类专栏： scrapy爬虫

本文链接：https://blog.csdn.net/for_syq/article/details/105703364

版权

scrapy爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

urlhttps://bj.lianjia.com/jingjiren/
我们要爬取的是这个网站(我爬取的是前10页的内容，网盘里面的爬取的是100页（也就是全部）)
文章最后有这个项目的百度网盘链接
在这里插入图片描述
1.创建爬虫项目

scrapy startproject lianjia_agent（创建项目）
cd lianjia_agent（进入到项目中）
scrapy genspider agent（爬虫名） lianjia.com（域名）

创建好爬虫项目后会出现下面几个文件
在这里插入图片描述
2.修改项目设置（setting.py）

3.编写要爬取的内容字段（items.py）

 name = scrapy.Field()#姓名
    href = scrapy.Field()#详情页链接
    plate= scrapy.Field()#主营模块
    history = scrapy.Field()#历史成交量
    score = scrapy.Field()#综合评分
    comments = scrapy.Field()#评论
    contact = scrapy.Field()#联系方式
    length_of_service = scrapy.Field()#平台服务年限
    personal = scrapy.Field()#个人成绩

4.编写爬虫（agent.py）
在这里插入图片描述

# -*- coding: utf-8 -*-
import scrapy,copy
from ..items import LianjiaAgentItem

class AgentSpider(scrapy.Spider):
    name = 'agent'#爬虫名
    allowed_domains = ['lianjia.com']#允许爬虫爬取的域名
    start_urls = ['https://bj.lianjia.com/jingjiren/pg{}/'.format(i) for i in range(1,10)]#start-url

    def parse(self, response):#下面就是数据的提取和整理
        li_list=response.xpath('//div[@class="list-wrap"]/ul/li')
        item=LianjiaAgentItem()
        for li in li_list:
            item['name']=li.xpath('.//div[@class="agent-name"]/a/h2/text()').extract_first()
            item['href']=li.xpath('.//div[@class="agent-name"]/a/@href').extract_first()
            item['plate']=li.xpath('.//div[@class="main-plate"]/span[2]//text()').extract()
            item['plate']=[i.strip()for i in item['plate']]
            item['plate']=','.join(item['plate'])
            item['plate']='主营板块:'+item['plate']

            item['history']=li.xpath('.//div[@class="achievement"]//text()').extract()
            item['history']=list(item['history'][0::2])
            item['history']=item['history'][0]+item['history'][1]
            item['history']='最近信息:'+item['history']
            item['score']=li.xpath('.//div[@class="high-praise"]/span/text()').extract_first()
            item['score']='综合评分:'+item['score']
            item['comments']=li.xpath('.//div[@class="comment-num"]//a/text()').extract_first()
            item['contact']=li.xpath('.//div[@class="col-3"]/h2/text()').extract_first()
            # print(item,end='\n\n'+'*'*100+'\n')
            yield scrapy.Request(item['href'],callback=self.details,
                                 meta={'item':copy.deepcopy(item)})
    def details(self,response):
        item=response.meta['item']
        item['length_of_service']=response.xpath('//ul[@class="info-list is-ke"]/li[1]/span[2]/text()').extract_first()
        item['length_of_service']='服务平台年限:'+item['length_of_service']
        item['personal']=response.xpath('//ul[@class="info-list is-ke"]/li[2]/span[2]/text()').extract_first()
        item['personal']='个人成绩:'+item['personal']
        # print(item,end='\n\n'+'*'*100+'\n')
        yield item

5.编写管道保存数据（pipelines.py）
在这里插入图片描述

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import csv
class LianjiaAgentPipeline(object):
    def process_item(self, item, spider):
        args=[item['name'],item['length_of_service'],item['plate'],item['personal'],item['history'],
              item['comments'],item['score'],item['contact']]
        with open('agent.csv','a+',encoding='utf-8',newline='')as f:#保存为csv文件
            w=csv.writer(f)
            w.writerow(args)
        print('正在保存:{}\t\t{}'.format(item['name'],item['contact']))

在这里插入图片描述
6.自己在创建一个run.py 文件方便在pycharm中运行（当然也可以在cmd中输入命令：scrapy crawl agent --nolog）

import os
os.system('scrapy crawl agent --nolog')

最后就是运行run.py
在这里插入图片描述

链接：https://pan.baidu.com/s/1KrQgDLKvA0-LjxXShlLIQQ
提取码：y318

for_syq

关注

3
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
scrapy爬取链家经纪人的数据（详细教程）

urlhttps://bj.lianjia.com/jingjiren/我们要爬取的是这个网站(我爬取的是前10页的内容，网盘里面的爬取的是100页（也就是全部）)文章最后有这个项目的百度网盘链接1.创建爬虫项目scrapy startproject lianjia_agent（创建项目）cd lianjia_agent（进入到项目中）scrapy genspider agent（爬...
复制链接

扫一扫

专栏目录