Python实现Github下载工具

最新推荐文章于 2024-08-19 10:49:51 发布

sherpahu

最新推荐文章于 2024-08-19 10:49:51 发布

阅读量6.2k

点赞数 1

分类专栏： Python 爬虫文章标签： Github Python 爬虫

本文链接：https://blog.csdn.net/sherpahu/article/details/81022575

版权

Python 同时被 2 个专栏收录

26 篇文章 0 订阅

订阅专栏

爬虫

3 篇文章 0 订阅

订阅专栏

Github单个文件没有下载的按钮,在知乎上看到过一些下载方法链接,gitzip必须得一个一个文档双击,并且不能一次性下载一个文件夹,downzip有时候却是下载失败(当然通常情况还是挺好用的),参考大神的第三种方法,找raw文件地址,我这里通过Python爬虫自动完成整个网页中文档或文件夹的下载

import requests
import re
from urllib.request import urlretrieve
import os
from tkinter import *
from tkinter.filedialog import askdirectory

#获取网址的HTML,再return给正则函数
def get_html(url):
    r=requests.get(url)
    return r.text
#用正则库解析HTML,获取网页中文件的url,再return给下载函数
def get_url(html):
    urls=re.findall('<a class="js-navigation-open"[^>]+href=["\'](.*?)["\']',html,re.S|re.M)
    titles=re.findall('<a class="js-navigation-open" title="(.*?)"',html,re.S|re.M)
    return urls,titles

#判断是否文件夹,True是文件夹,False不是,警告:该判断仅利用是否为 .字母 结尾来判断是否为文件夹,存在缺陷,有可能误判
def judge(title):
    pattern='[\w.]+\.\w+'
    if re.findall(pattern,title)==[]:
        return True
    else:
        return False

#文件夹处理
def folders(folderUrl,folderTitle,oldPath):
    print('下载 '+folderTitle+' 文件夹')
    savepath=str(oldPath)+'/'+folderTitle
    os.makedirs(savepath)
    #os.mkdir(savepath)
    html=get_html(folderUrl)
    urls,titles=get_url(html)
    for url,title in zip(urls,titles):
        oldurl='https://github.com'+url
        pattern=re.compile('/blob')
        url=pattern.sub('',url)
        print(title)
        if judge(title)==False:
            download('https://raw.githubusercontent.com'+url,title, savepath)
        else:
            folders(oldurl,title,savepath)
#将一个网页中的文件全部下载,不考虑文件夹
def download(url,title, savepath='./'):
    def reporthook(a, b, c):
        #显示下载进度
        if c!=0:
            print("\rdownloading: %5.1f%%" % (a * b * 100.0 / c), end="")
    #filename = os.path.basename(url)
    filename=title
    # 判断文件是否存在，如果不存在则下载
    if not os.path.isfile(os.path.join(savepath, filename)):
        print('Downloading data from %s' % url)
        urlretrieve(url, os.path.join(savepath, filename), reporthook=reporthook)
        print('\nDownload finished!')
    else:
        print('File already exsits!')
    # 获取文件大小
    filesize = os.path.getsize(os.path.join(savepath, filename))
    # 文件大小默认以Bytes计， 转换为Mb
    print('File size = %.2f Mb' % (filesize/1024/1024))


#Tkinter图形界面
root = Tk()
root.title('Github一键下载器')
path = StringVar()
path_cun = StringVar()
#选择存储位置
def selectPath():
    global path_#全局
    path_= askdirectory()
    path.set(path_)

#achieve_url=StringVar()
#按扭调用的下载函数，得到url和路径之后通过此函数进行下载
def guiDownload():
    achieve_url = e_url.get()
    html=get_html(achieve_url)
    urls,titles=get_url(html)
    savepath=path_
    for url,title in zip(urls,titles):
        oldurl='https://github.com'+url
        pattern=re.compile('/blob')
        url=pattern.sub('',url)
        url='https://raw.githubusercontent.com'+url
        if judge(title)==False:
            download(url,title, savepath)
        else:
            folders(oldurl,title,savepath)

#第一行，下载地址标签及输入框
l_url =Label(root,text='下载地址')
l_url.grid(row=0,sticky=W)
e_url =Entry(root)
e_url.grid(row=0,column=1,sticky=E)

#第二行,目标路径标签及路径选择按钮
Label(root,text = "目标路径:").grid(row = 1, column = 0)
Entry(root, textvariable = path).grid(row = 1, column = 1)
Button(root, text = "路径选择", command = selectPath).grid(row = 1, column = 2)

#第三行登陆按扭，command绑定事件,激发下载事件
b_login = Button(root,text='下载',command=guiDownload)
b_login.grid(row=2,column=0,sticky=E)

root.mainloop()

利用requests库获得网页的HTML代码,re进行正则表达式匹配,获得所需下载的文件的网址,在进一步利用raw处理之后就可以得到文件的下载地址,使用urlretrieve实现文件的下载

最后使用tkinter实现图形界面

这个小项目算是对于Python爬虫的复习巩固,对于tkinter的初步了解

大一这一年疲于奔命,学到的有用的东西真是少之又少,屠龙之术倒是学了一大堆,这个暑假得把过去一直感兴趣想学而没时间的东西好好学一下

偏题了

总之,这个项目只是用于练手,由于Github的速度很慢(改了Hosts之后速度提升好像不是太明显,依旧比较卡),所以更建议在Downzip等网站能够正常使用时获取下载链接,再通过IDM等多线程下载器进行下载,这一般是可以满足要求的. 网速实在不行的话还可以尝试翻墙(我的vpn貌似有限速,翻了墙更慢,真是无语)

之后,再尝试使用多线程,多进程进行加速