爬虫实战二:百度贴吧之全吧搜索

百度搜索每个关键字最多显示76页,每页最多显示10条数据。(截至2022/11/23)

一、爬虫需求

在这里插入图片描述

_id = scrapy.Field()                    # _id:data_tid也是thread_id,该帖的tid号
fid = scrapy.Field()                    # 贴吧的id,方便后续获取json包
title = scrapy.Field()                  # 帖名,问题名
thread_url = scrapy.Field()             # 该帖的URL
content = scrapy.Field()                # 内容(实际为楼层页的第一层内容,可考虑不要)
forum_name = scrapy.Field()             # 所属贴吧名
author = scrapy.Field()                 # 帖主即作者
create_time = scrapy.Field()            # 发帖时间

在这里插入图片描述

二、字段分析

1、搜索关键词的编码为gbk编码,如果直接字段组合连接,获得的是utf-8的编码,因此需要进行转换。

req_url = 'https://tieba.baidu.com/f/search/res?isnew=1&kw=&qw=' + quote(keyword.encode('utf-8').decode('utf-8').encode('gbk')) + '&rn=10&un=&only_thread=1&sm=1&sd=&ed=&pn=' + str(page)
> https://tieba.baidu.com/f/search/res?isnew=1&kw=&qw=%C2%E5%CC%EC%D2%C0&rn=10&un=&only_thread=1&sm=1&sd=&ed=&pn=1

# 该方式无法正确获取到数据
req_url = 'https://tieba.baidu.com/f/search/res?isnew=1&kw=&qw=' + str(keyword)  + '&rn=10&un=&only_thread=1&sm=1&sd=&ed=&pn=' + str(page)
> https://tieba.baidu.com/f/search/res?isnew=1&kw=&qw=%E6%B4%9B%E5%A4%A9%E4%BE%9D&rn=10&un=&only_thread=1&sm=1&sd=&ed=&pn=1 
     

三、代码实现

def start_requests(self):
    keyword  = '洛天依'
    # 一个关键字搜索最多显示76页,用range循环结束元素需要加1
    for page in range(1,77):
        req_url = 'https://tieba.baidu.com/f/search/res?isnew=1&kw=&qw=' + quote(keyword.encode('utf-8').decode('utf-8').encode('gbk')) \
        + '&rn=10&un=&only_thread=1&sm=1&sd=&ed=&pn=' + str(page)
        yield Request(url=req_url,headers=UA_header)

def parse(self, response: HtmlResponse, **kwargs):
    response_data = etree.HTML(response.text)
    post_lists = response_data.xpath('//div[@class="s_post_list"]/div')
    TiebaPost = TiebaKeysearchItem()
    if post_lists == []:
        print('没有获取到内容')
    else:
        for post in post_lists:
            TiebaPost['_id'] = post.xpath('./span/a/@data-tid')[0]
            TiebaPost['fid'] = post.xpath('./span/a/@data-fid')[0]
            # 只获取文本
            TiebaPost['title'] = post.xpath('./span/a')[0].xpath('string(.)').strip()
            TiebaPost['thread_url'] = 'https://tieba.baidu.com/' + post.xpath('./span/a/@href')[0]
            TiebaPost['content'] = post.xpath('./div[1]')[0].xpath('string(.)').strip()
            TiebaPost['forum_name'] = post.xpath('./a[1]/font/text()')[0]
            try:
                TiebaPost['author'] = post.xpath('./a[2]/font/text()')[0]
            except:
                TiebaPost['author'] = ''
            TiebaPost['create_time'] = post.xpath('./font/text()')[0]
            print(TiebaPost)

如果有其他的问题欢迎讨论~

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值