Scrapy 爬取贴吧的标题和链接

又逢乱世

已于 2024-08-14 19:32:02 修改

阅读量121

点赞数 4

分类专栏：爬虫文章标签： scrapy 爬虫

于 2024-08-14 19:27:17 首次发布

本文链接：https://blog.csdn.net/a1053765496/article/details/141198069

版权

爬虫专栏收录该内容

8 篇文章 0 订阅

订阅专栏

免责声明

感谢您学习本爬虫学习Demo。在使用本Demo之前，请仔细阅读以下免责声明：

学习和研究目的：本爬虫Demo仅供学习和研究使用。用户不得将其用于任何商业用途或其他未经授权的行为。

合法性：用户在使用本Demo时，应确保其行为符合法律法规。请务必了解并遵守目标网站的服务条款和隐私政策。

道德规范：请尊重目标网站的使用条款，不要对其服务器造成过大的负载或影响其正常运行。

知识产权保护：抓取的数据应仅用于个人学习和研究，不得用于侵犯版权及其他知识产权的行为。

隐私保护：请勿抓取或存储包含个人敏感信息的数据，避免侵犯他人隐私权。

责任自负：使用本Demo可能存在一定风险，包括但不限于法律风险、数据丢失、账户封禁等。用户需自行承担所有相关风险和责任。

博主不对因使用本Demo而产生的任何直接或间接损失承担责任。用户应对其使用行为负责，并自行承担所有可能的后果。

重要提示：在使用本Demo前，请确保已详细阅读并理解上述免责声明。如有任何疑虑，请立即停止使用。

学习爬虫的一个demo，使用 Scrapy

爬取贴吧的某一个话题论坛的帖子标题和链接

Scrapy 爬虫代码：

import scrapy
from urllib.parse import unquote


class TeiBaSpider(scrapy.Spider):
    name = "teiba"
    allowed_domains = ["tieba.xxx.com"]
    start_urls = ["https://tieba.xxx.com/f?kw=沙井"]

    # 指定 pipeline
    custom_settings = {
        'ITEM_PIPELINES': {
            'myproject.pipelines.TieBaPipeline': 300,
        }
    }

    def start_requests(self):
        # 指定请求头
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
            'Accept-Language': 'zh-CN,zh;q=0.9',
        }
        for url in self.start_urls:
            yield scrapy.Request(url, headers=headers, callback=self.parse)

    # 解析处理
    def parse(self, response):
        # 删除 HTML 中的注释
        cleaned_html = response.text.replace('<!--', '').replace('-->', '')

        # 使用修改后的 HTML 生成新的 Response 对象
        new_response = response.replace(body=cleaned_html)

        # 获取贴吧内容a标签
        a_list = new_response.xpath('//ul/li//div[@class="threadlist_title pull_left j_th_tit "]/a')

        # 获取贴吧标题和连接
        for a in a_list:
            item = {'title': a.xpath('./@title').get(), 'url': "https://tieba.baidu.com/" + a.xpath('./@href').get()}
            # yield 的数据会返回到 TieBaPipeline 中
            yield item

        # 获取下一页的链接，没有获取到返回None
        next_page = "https:" + new_response.xpath('//*[contains(text(), "下一页")]/@href').get()
        if next_page:
            # url中有中文，默认会进行百分号编码，unquote 将百分号编码的url按中文显示
            next_page_url = unquote(next_page)
            print("下一页：", next_page_url)
            # 指定 callback 为 parse 方法，进行递归处理下一页数据
            yield scrapy.Request(next_page_url, callback=self.parse)
        else:
            # next_page 是 None，即没有更多的页面
            self.log('没有更多页面可供抓取.')

Scrapy 管道代码：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

from pymongo import MongoClient


class MyprojectPipeline:
    def process_item(self, item, spider):
        return item


class TieBaPipeline:

    def __init__(self):
        self.client = MongoClient("mongodb://admin:123456@192.168.189.71:27017/admin?authSource=admin")
        self.collection = self.client["spider"]["tieba"]

    # 接收爬虫 yield 返回过来的数据
    def process_item(self, item, spider):
        # 保存数据，这里把数据保存在 mongodb 中
        result = self.collection.insert_one(item)
        print(result.inserted_id)  # 打印 mongo 数据 id
        return item

    def __del__(self):
        self.client.close()

settings.py ROBOTSTXT_OBEY 配置为 False

数据存储效果