python爬虫获取腾讯在线文档内容

最新推荐文章于 2024-07-24 23:49:50 发布

滑滑板的蜗牛

最新推荐文章于 2024-07-24 23:49:50 发布

阅读量191

点赞数 5

分类专栏： python 文章标签： python 爬虫开发语言 vscode

本文链接：https://blog.csdn.net/Pcy277921981/article/details/140638045

版权

python 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

声明

本文讲述使用python爬虫下载腾讯文档中的Excel数据，希望这次经验分享对大家在使用网络爬虫时有所帮助。记住，合理、安全地使用网络爬虫才能更好地获取所需数据，同时也维护了互联网生态的健康发展。让我们共同努力，让网络爬虫成为推动信息获取和知识传播的强大工具！

一、数据准备

一共需要准备三个参数：document_url、document_pad_id、cookie_str

1、文档的URL：document_url

如图所示，获取想要下载的url信息
在这里插入图片描述

2、获取document_pad_id

如图获取对应的padid值
在这里插入图片描述

3、获取cookie_str

如图获取自己的cookie，选中复制。
在这里插入图片描述

二、完整代码

import time
from datetime import datetime
import requests

class getTengXunDoc:

    def __init__(self, document_url, document_pad_id, cookie_str):
        # 腾讯在线文档的地址
        self.document_url = document_url
        # 每个腾讯在线文档有一个唯一的值,需要手动获取（上述步骤中有提及）
        self.document_pad_id = document_pad_id
        self.headers = {
            "content-type": "application/x-www-form-urlencoded",
            "Cookie": cookie_str,
        }

    # 获取操作ID
    def getOperationId(self, export_excel_url):
        body = {"docId": self.document_pad_id, "version": "2"}

        res = requests.post(
            url=export_excel_url, headers=self.headers, data=body, verify=False
        )
        operation_id = res.json()["operationId"]
        return operation_id

    def ExcelDownload(self, check_progress_url, file_name):
        # 拿到下载excel文件的url
        start_time = time.time()
        file_url = ""
        while True:
            res = requests.get(
                url=check_progress_url, headers=self.headers, verify=False
            )
            progress = res.json()["progress"]
            if progress == 100:
                file_url = res.json()["file_url"]
                break
            elif time.time() - start_time > 30:
                print("准备超时,请排查")
                break
        if file_url:
            self.headers["content-type"] = "application/octet-stream"
            res = requests.get(url=file_url, headers=self.headers, verify=False)
            with open(file_name, "wb") as f:
                f.write(res.content)
            print("下载成功,文件名: " + file_name)
        else:
            print("下载文件地址获取失败, 下载excel文件不成功")


if __name__ == '__main__':
    # 数据准备步骤一获取
    document_url = 'https://docs.qq.com/sheet/DVnN2VFpJYkxmeWtV'
    # 数据准备步骤二获取
    document_pad_id = '300000000$VsvTZIbLfykU'
    # 数据准备步骤三获取
    cookie_str = '自己的cookie'
    tx = getTengXunDoc(document_url, document_pad_id, cookie_str)
    # # 导出文件任务url
    export_excel_url = f'https://docs.qq.com/v1/export/export_office'
    # 获取导出任务的操作id，
    operation_id = tx.getOperationId(export_excel_url)

    check_progress_url = f'https://docs.qq.com/v1/export/query_progress?operationId={operation_id}'
    current_datetime = datetime.strftime(datetime.now(), '%Y_%m_%d_%H_%M_%S')
    file_name = f'{current_datetime}.xlsx'
    # 下载文件
    tx.ExcelDownload(check_progress_url, file_name)

如果我的文章对你有帮助，感谢你点的关注~

请添加图片描述

滑滑板的蜗牛

关注

5
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python爬虫获取腾讯在线文档内容

本文讲述使用python爬虫下载腾讯文档中的Excel数据，希望这次经验分享对大家在使用网络爬虫时有所帮助。记住，合理、安全地使用网络爬虫才能更好地获取所需数据，同时也维护了互联网生态的健康发展。让我们共同努力，让网络爬虫成为推动信息获取和知识传播的强大工具！一共需要准备三个参数：document_url、document_pad_id、cookie_str。如图所示，获取想要下载的url信息。如图获取对应的padid值。
复制链接

扫一扫