利用Scrapy爬取掘金网站用户信息

小思非陌

已于 2023-11-23 16:32:48 修改

阅读量145

点赞数 2

文章标签： scrapy 爬虫 python

于 2023-11-23 16:18:40 首次发布

本文链接：https://blog.csdn.net/qq_44827933/article/details/134524595

版权

声明：该爬虫是在23年3月编写，网页元素之后可能发生改动，可以自行判断分析，修改相应内容

项目github地址：https://github.com/xsfmGenius/juejin_spider

分析需求

本项目预计爬取juejin用户信息，包含字段如下图所示。从优质作者排行榜开始爬取信息，再根据所爬用户的粉丝进行下一层爬取。
在这里插入图片描述

创建项目

cmd选择合适的文件夹，输入：

$scrapy startproject   项目名

自动生成如下结构：
在这里插入图片描述

$cd spider
$scrapy genspider 爬虫名 网址

在spiders文件夹下自动生成一个爬虫文件，便于进行修改。
在这里插入图片描述

修改配置

修改settings.py文件

USER_AGENT =自行查看USER_AGENT
ROBOTSTXT_OBEY = False

在这里插入图片描述

定义数据

修改items.py文件，该文件定义了需要获取的数据。

import scrapy
class SegmentfaultspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    up = scrapy.Field()
    read = scrapy.Field()
    reputation = scrapy.Field()
    time = scrapy.Field()
    followA = scrapy.Field()
    Afollow = scrapy.Field()
    loc = scrapy.Field()
    company = scrapy.Field()
    introduce = scrapy.Field()

编写爬虫

排行榜界面

所有函数写在刚刚生成的爬虫文件的爬虫类中。设置初始网址，即最先需要爬取数据的一个或几个网站。

class UserspiderSpider(scrapy.Spider):
    name = 'userspider'
    # allowed_domains = ['https://juejin.cn/hot/authors/6809637769959178254']
    start_urls = [
                  'https://juejin.cn/hot/authors/6809635626879549454',
                  'https://juejin.cn/hot/authors/6809637773935378440',
                  'https://juejin.cn/hot/authors/6809637771511070734',
                  'https://juejin.cn/hot/authors/6809637776263217160',
                  'https://juejin.cn/hot/authors/6809637772874219534'
                  ]

由于排行榜页面是动态加载的，无法直接获取到内容。使用selenium库的WebDriver创建浏览器对象，模拟浏览器的访问过程。
需要在初始化函数中声明webdriver对象，并设置相应参数。

    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-gpu')
        options.add_argument('--headless')
        options.add_experimental_option('excludeSwitches',['enable-automation'])
        self.bro=webdriver.Chrome(chrome_options=options)

同时需要在middlewares.py文件中自定义中间件，用于拦截并处理Scrapy引擎接收的响应。

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        bro = spider.bro
        if request.url in spider.start_urls:
            bro.get(request.url)
            sleep(3)
            page_text = bro.page_source
            new_res = HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request)
            return new_res
        else:
            return response

parse函数是爬虫起始函数，在parse函数中写明爬虫的具体流程。本函数目的是通过排行榜界面进入上榜用户界面，获取他们的信息。
分析网址，通过xpath获取相应标签中的链接，加上固定的"https://juejin.cn"前缀获取榜上用户个人主页的链接，调用parse_detail函数，该函数用于获取个人主页中用户信息。
爬取结束后将id传入parse_follow函数，用于爬取粉丝数据。
在这里插入图片描述

如何获取xpath

在这里插入图片描述

分析网址

    def parse(self, response):
        name_list = response.xpath('//div[@class="hot-list"]/a/@href').extract()
        # print(name_list)
        for name in name_list:
            detail_url = "https://juejin.cn" + name
            # print(detail_url)
            yield scrapy.Request(detail_url, callback=self.parse_detail)
            id = name.split("/")[-1]
            # print(id)
            yield scrapy.Request(detail_url,callback=self.parse_follow,meta={'id':id},dont_filter=True)
        # detail_url = "https://juejin.cn" + name_list[0]
        # yield scrapy.Request(detail_url, callback=self.parse_detail)

排行榜用户个人界面

parse_detail用于获取个人界面中的用户信息，分析所要爬取元素的xpath，注意如果界面变化或不存在该内容时的处理。yield newitem将获取的用户信息返回至pipeline.py中，该文件处理保存数据。

   def parse_detail(self,response):
       # print(response.text)
       # with open("tmp.html", 'w', encoding='utf-8') as f:
       #     f.write(response.text)
       newitem=SegmentfaultspiderItem()
       newitem['name']=response.xpath('//span[@class="user-name"]/text()')[0].extract()
       if len(response.xpath('//div[@class="block-body"]//span[@class="count"]').extract())==3:
           newitem['up']=response.xpath('//div[@class="block-body"]//span[@class="count"]/text()')[0].extract().replace(',', '')
           newitem['read']=response.xpath('//div[@class="block-body"]//span[@class="count"]/text()')[1].extract().replace(',', '')
           newitem['reputation']=response.xpath('//div[@class="block-body"]//span[@class="count"]/text()')[2].extract().replace(',', '')
       else:
           if len(response.xpath('//div[@class="block-body"]//span[@class="count"]').extract())==2:
               newitem['up'] = "-1"
               newitem['read'] = response.xpath('//div[@class="block-body"]//span[@class="count"]/text()')[0].extract().replace(',', '')
               newitem['reputation'] = response.xpath('//div[@class="block-body"]//span[@class="count"]/text()')[1].extract().replace(',', '')
           else:
               newitem['up']="-1"
               newitem['read'] = "-1"
               newitem['reputation'] = "-1"
       newitem['time']=response.xpath('//div[@class="item-count"]/time/text()')[0].extract().replace('\n', '').replace(' ', '')
       newitem['Afollow']=response.xpath('//div[@class="follow-block block shadow"]//div[@class="item-count"]/text()')[0].extract().replace('\n', '').replace(' ', '').replace(',', '')
       newitem['followA']=response.xpath('//div[@class="follow-block block shadow"]//div[@class="item-count"]/text()')[1].extract().replace('\n', '').replace(' ', '').replace(',', '')
       if len(response.xpath('//div[@class="position"]').extract())==0:
           newitem['loc'] = ""
           newitem['company'] = ""
       else:
           newitem['loc']=response.xpath('//div[@class="position"]/span/node()[1]')[0].extract()
           if newitem['loc']!='<!---->':
               newitem['loc']=response.xpath('//div[@class="position"]/span/node()[1]/text()')[0].extract()
           else:
               newitem['loc']=""
           newitem['company'] = response.xpath('//div[@class="position"]/span/node()[5]')[0].extract()
           if newitem['company']!='<!---->':
               newitem['company']=response.xpath('//div[@class="position"]/span/node()[5]/text()')[0].extract()
           else:
               newitem['company']=""
       if len(response.xpath('//div[@class="intro"]//span[@class="content"]/text()').extract())!=0:
           newitem['introduce']=response.xpath('//div[@class="intro"]//span[@class="content"]/text()')[0].extract().replace('\n', '')
       else:
           newitem['introduce']=""
       print(newitem)
       yield newitem

获取粉丝信息

打开用户的粉丝界面，分析传输数据发现可以直接根据请求Url获取粉丝信息，cursor参数代表请求的第一个粉丝的序号，limit=20参数表示每次请求的数据数量为20条。根据粉丝数判断循环次数。对于每一个请求Url调用parse_json函数获取粉丝的id。
在这里插入图片描述

    def parse_follow(self, response):
        # print(response.meta['id'])
        follownum=int(response.xpath('//div[@class="follow-block block shadow"]//div[@class="item-count"]/text()')[1].extract().replace('\n', '').replace(' ', '').replace(',', ''))
        i=0
        while i<follownum:
            url="https://api.juejin.cn/user_api/v1/follow/followers?aid=2608&uuid=7208838064973252151&spider=0&user_id="+str(response.meta['id'])+"&cursor="+str(i)+"&limit=20"
            time.sleep(random.randint(1,4))
            yield scrapy.Request(url, callback=self.parse_json)
            i+=20

对于每一条请求Url根据正则式获取粉丝的id并再次调用parse_detail获取其个人信息。
由于用户数量满足需求，未再进行下一层爬取，可多次循环爬取粉丝的粉丝的个人信息。

  def parse_json(self,response):
        # print(response)
        results = response.text
        pattern=r'"user_id":"(.*?)","user_name"'
        ids=re.findall(pattern,results)
        # print(ids)
        for id in ids:
            url="https://juejin.cn/user/"+id
            yield scrapy.Request(url, callback=self.parse_detail)
        # results = response.json()
        # print(results['data'][)

保存数据

pipelines.py中的默认类用于处理保存爬虫文件中yield的数据，初始化函数声明连接的数据库，process_item中写sql语句保存数据。

from itemadapter import ItemAdapter

import pymysql
from segmentfaultspider.items import SegmentfaultspiderItem

class SegmentfaultspiderPipeline(object):
    def __init__(self):
        self.connection=pymysql.connect(host='xxx.xxx.xxx.xxx(自行修改)',
                port=3306,
                user='root',
                password='xxxxxxx(自行修改)',
                db='xxxxxxx(自行修改)',
                charset='utf8mb4')
    def process_item(self, item, spider):
        if isinstance(item, SegmentfaultspiderItem):
            self.cursor = self.connection.cursor()
            try:
                self.cursor.execute('INSERT INTO users (`name`,`up`,`read`,`reputation`,`followA`,`Afollow`,`time`,`loc`,`company`,`introduce`) VALUES ("%s","%s","%s","%s","%s","%s","%s","%s","%s","%s")'%(item["name"],item["up"],item["read"],item["reputation"],item["followA"],item["Afollow"],item["time"],item["loc"],item["company"],item["introduce"]))
                self.connection.commit()
            except Exception as e:
                print(e)
                self.connection.rollback()
    #