如何通过Python和Bash从Google Drive上批量下载文件夹下的全部文件

最新推荐文章于 2023-11-17 20:28:25 发布

Edward_ed_liu

最新推荐文章于 2023-11-17 20:28:25 发布

阅读量2.6k

点赞数 1

分类专栏：奇技淫巧文章标签： python bash

本文链接：https://blog.csdn.net/Edward_ed_liu/article/details/114226108

版权

奇技淫巧专栏收录该内容

2 篇文章 0 订阅

订阅专栏

最近有需求要从Google Drive上下载大量开源数据集，但这些数据文件以小文件的形式零散的分布在多个文件夹下。由于数据量特别大，如果从最顶层文件夹进行下载，Google会默认地首先将这些数据打成多个小zip包，然后再逐一将这些zip包下载下来。但由于梯子的不稳定性，按照正常的下载路子，若中途梯子不小心断了，那未下载完全的zip包由于不支持断点续传，都需要重新下载。因此想到可以租用围墙外边的云节点，先将全部数据舒服稳定地下载到节点上，再按照自己的喜好进行自由地打包（可以压缩后打成一整个bz2，也可以打成多个小包），然后再根据自己的喜好将数据从云端download到本地。

但是由于云端节点没有操作系统的图形界面，因此只能在Shell上进行download操作，而普通的wget无法下载整个文件夹，只能另辟蹊径了。PS：该方法同样也适用于需要在Linux Shell终端上自动下载Google Drive上的文件夹以及大量小文件等需求。

Step 1：启用Google Drive API

参考https://developers.google.com/drive/api/v3/quickstart/python，完成链接中的步骤并下载credentials.json文件，重命名成client_secrets.json

Step 2：安装Python

不提了。

Step 3: 安装pydrive2

pip install pydrive2

Step 4：如果在墙里面则要让python爬梯子，那么需要安装socket。墙外面的忽略这一步。

pip install socks socket

Step 5：看注释修改一些零零碎碎，用你喜欢的姿势执行下面的python脚本。

from pydrive2.auth import GoogleAuth
from pydrive2.drive import GoogleDrive
import os

########## 要用梯子的请从这里开始爬，不需要的请注释掉 #############
import socks
import socket
socks.set_default_proxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", <replace with your vpn SOCKS port>)
socket.socket = socks.socksocket
################## 梯子的顶端 ###################

gauth = GoogleAuth()
gauth.LocalWebserverAuth()
drive = GoogleDrive(gauth)

# 设置需要下载的Google Drive文件夹ID。相信老哥，你可以在文件夹的URL中找到它。
parent_folder_id = '<replace with google drive folder id>'

# 设置你在Bash上的下载路径。
parent_folder_dir = '<replace with the download path on remote server>'

if parent_folder_dir[-1] != '/':
  parent_folder_dir = parent_folder_dir + '/'

# 目前还是使用wget进行下载，未来嘛，不好说，看脸了。
wget_text = '"wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&amp;confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate \'https://docs.google.com/uc?export=download&amp;id=FILE_ID\' -O- | sed -rn \'s/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p\')&id=FILE_ID" -O FILE_NAME && rm -rf /tmp/cookies.txt"'.replace('&amp;', '&')


# Get the folder structure

file_dict = dict()
folder_queue = [parent_folder_id]
dir_queue = [parent_folder_dir]
cnt = 0

while len(folder_queue) != 0:
  current_folder_id = folder_queue.pop(0)
  file_list = drive.ListFile({'q': "'{}' in parents and trashed=false".format(current_folder_id)}).GetList()

  current_parent = dir_queue.pop(0)
  print(current_parent, current_folder_id)
  for file1 in file_list:
      file_dict[cnt] = dict()
      file_dict[cnt]['id'] = file1['id']
      file_dict[cnt]['title'] = file1['title']
      file_dict[cnt]['dir'] = current_parent + file1['title']

      if file1['mimeType'] == 'application/vnd.google-apps.folder':
          file_dict[cnt]['type'] = 'folder'
          file_dict[cnt]['dir'] += '/'
          folder_queue.append(file1['id'])
          dir_queue.append(file_dict[cnt]['dir'])
      else:
          file_dict[cnt]['type'] = 'file'

      cnt += 1

f = open('script.sh', 'w')    # 输出Bash脚本文件名
file_dict.keys()
for file_iter in file_dict.keys():
  if file_dict[file_iter]['type'] == 'folder':
      f.write('mkdir ' + file_dict[file_iter]['dir'] + '\n')
  else:
      f.write(wget_text[1:-1].replace('FILE_ID', file_dict[file_iter]['id']).replace('FILE_NAME', file_dict[file_iter]['dir']) + '\n')
f.close()

Step 6：上述Python脚本执行完后会在当前路径生成一个script.sh的bash脚本。在下载服务器上创建好下载路径并设置好权限后，将该bash脚本上传到服务器上执行完事儿。