scrapy 爬取新浪微博的微博列表及微博内容

最新推荐文章于 2024-09-01 16:40:34 发布

匆匆流年。

最新推荐文章于 2024-09-01 16:40:34 发布

阅读量1.3w

点赞数

分类专栏：爬虫文章标签： scrapy

本文链接：https://blog.csdn.net/m0_37932636/article/details/86496284

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

代码地址：GitHub

参考：博客

通过scrapy框架爬取指定账号的信息和微博

截止到目前(2019年01月15日)的微博账号粉丝排名：

爬取方法：提取网页版的微博接口

1.重写start_request方法

    def start_requests(self):
        weibo_id = [1195354434, ]
        for wid in weibo_id:
            print('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid))
            yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid), callback=self.parse_userInfo, dont_filter=True,
                          meta={'uid': str(wid)})

2.解析个人信息，并获取containerid

3.爬取博主的微博信息，和他关注的人

    # 解析微博列表
    def parse_weibo_list(self, response):
        # 取相关信息，方便爬取下一页
        next_page = str(int(response.meta['page']) + 1)
        uid = response.meta['uid']
        containerid = response.meta['containerid']

        data = response.text
        content = json.loads(data).get('data')
        cards = content.get('cards')

        if (len(cards) > 0):
            print("-----正在爬取第%s页-----" % str(response.meta['page']))

            for j in range(len(cards)):
                card_type = cards[j].get('card_type')
                # 微博
                # if card_type == 9:
                #     mblog = cards[j].get('mblog')
                #     attitudes_count = mblog.get('attitudes_count')  # 点赞数
                #     comments_count = mblog.get('comments_count')  # 评论数
                #     created_at = self.date_format(mblog.get('created_at'))  # 发布时间
                #     reposts_count = mblog.get('reposts_count')  # 转发数
                #     scheme = cards[j].get('scheme')  # 微博地址
                #     # 替换换行后 提取字符串
                #     text = etree.HTML(str(mblog.get('text')).replace('<br />', '\n')).xpath('string()')  # 微博内容
                #     pictures = mblog.get('pics')  # 正文配图，返回list
                #     pic_urls = []  # 存储图片url地址
                #     if pictures:
                #         for picture in pictures:
                #             pic_url = picture.get('large').get('url')
                #             pic_urls.append(pic_url)
                #     uid = response.meta['uid']
                #     # 保存数据
                #     sinaitem = SinaItem()
                #     sinaitem["uid"] = uid
                #     sinaitem["text"] = text
                #     sinaitem["scheme"] = scheme
                #     sinaitem["attitudes_count"] = attitudes_count
                #     sinaitem["comments_count"] = comments_count
                #     sinaitem["created_at"] = created_at
                #     sinaitem["reposts_count"] = reposts_count
                #     sinaitem["pictures"] = pic_urls
                #     yield sinaitem
                # 关注信息
                if card_type == 11:
                    # 获取他关注的人的地址
                    # https://m.weibo.cn/p/index?containerid=231051_-_followers_-_1195354434_-_1042015%3AtagCategory_050&luicode=10000011&lfid=1076031195354434 查看该网页的请求过程
                    fllow_url = str(cards[j]['card_group'][0]['scheme']).replace('https://m.weibo.cn/p/index?', 'https://m.weibo.cn/api/container/getIndex?')
                    print(fllow_url, '----')
                    yield Request(url=fllow_url, callback=self.parse_fllow)
            # 下一页链接
            # weibo_list_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + uid + '&containerid=' + containerid + '&page=' + next_page
            # response.meta['page'] = next_page
            # yield Request(weibo_list_url, callback=self.parse_weibo_list, meta=response.meta)

4.根据他关注的人的ID，再次重复此过程

    # 获取关注者的信息
    def parse_fllow(self, response):
        data = response.text
        content = json.loads(data).get('data')
        cards = content.get('cards')
        # if len(cards) > 0:
        for card in cards:
            if card.get('title') == '他的全部关注':
                for tmp in card.get('card_group'):
                    user = tmp.get('user')
                    # 获取关注的人的ID
                    uid = user.get('id')
                    yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(uid), callback=self.parse_userInfo, dont_filter=True,
                                  meta={'uid': str(uid)})

由于此过程是个循环，需要采取一定的控制条件才能爬取完成(如果不被封IP的话)

可先筛选出你感兴趣的用户，再爬取他的微博

防封的话建议采取代理IP的方式，在下载中间件中添加即可