Bilibili&YT网课时长统计

两种方式

  1. 浏览器手动拷贝网页元素
  2. 使用视频链接自动获取

1 手动拷贝Dom元素

利用正则表达式实现的前端元素->xls表格

使用步骤

  1. F12拷贝整个选集组件代码, 存于.txt文件
    在这里插入图片描述

  2. 调整输入输出路径, 选择正则配置, 运行如.py

  3. 为使时间列可以求和, 对表格进行分列. (文本分列向导直接点击完成即可)
    在这里插入图片描述

  4. sfa

python

import os
import re
import xlwt

book = xlwt.Workbook(encoding='utf-8',style_compression=0)
sheet = book.add_sheet('Name',cell_overwrite_ok=True)
savepath = 'D:\\Code\\PythonSE\\网课时长获取\\1.xls'
txt_path="D:\\Code\\PythonSE\\网课时长获取\\1.txt"

# 设置txt和xls路径
# 调find_by_re_and_fill_sheet正确传参

youtube_title=['<span id="video-title" .+\n.+\n.+span>',
               '.+\n.+</span>',
               "",
               "</span>",
               0]
youtube_time_span=['<span id="text" .+\n.+\n</span>',
                   '.+\n</span>',
                   "",
                   "</span>",
                   1]

bili_title=['<span class="part">.+?</span>',
            '"part">.+</span>',
            '"part">',
            "</span>",
            0]
bili_time_span=['class="duration">.+?</div>',
                'on">.+</div>',
                'on">',
                "</div>",
                1]



def find_by_re_and_fill_sheet(str,para_list, sheet):
    re_0=para_list[0]
    re_1=para_list[1]
    front=para_list[2]
    back=para_list[3]
    col_index=para_list[4]

    row_count = 0
    with_tag = re.findall(re_0, str, flags=0)
    for i in with_tag:
        without_tag = re.findall(re_1, i, flags=0)
        ret_str=without_tag[0].strip().replace("\n","").replace(front, "").replace(back,"")

        if(col_index==1):
            countt=ret_str.count(":")
            if(countt==1):
                ret_str="0:"+ret_str
        sheet.write(row_count, col_index,ret_str )
        row_count = row_count + 1

my_str=""
with open(txt_path,"r",encoding='utf-8') as or_file:
    data=or_file.readlines()
    for items in data:
        my_str=my_str+items


# 此处选择YT或是Bilibili的参数
find_by_re_and_fill_sheet(my_str,bili_title,sheet)
find_by_re_and_fill_sheet(my_str,bili_time_span,sheet)

book.save(savepath)

通过链接自动查找

import requests
import re
import json
import xlwt

savepath = 'D:\\Code\\PythonSE\\网课时长获取\\12.xls'
url = 'https://www.bilibili.com/video/BV1PY411e7J6/?'  # 替换为你要请求的网页的URL
Only_key_word_to_dfferentiat_scripts='title'

try:
    response = requests.get(url)
    if response.status_code == 200:
        html_content = response.text
    else:
        print(f"Failed to fetch page. Status code: {response.status_code}")

except requests.exceptions.RequestException as e:
    print(f"Error while making the request: {e}")



ree='<script>.+?</script>'
scripts = re.findall(ree, html_content, flags=0)



targe_script=''
incude_times=0
for script in scripts:
    if Only_key_word_to_dfferentiat_scripts in script: 
        targe_script=script
        incude_times+=1
        if incude_times>1:
            raise KeyError # 关键词重复了,无法区分<script>标签,可能是服务器已更新

        




reee='(?<==).+}'
json_data = re.findall(reee, targe_script, flags=0)[0]
# 贪婪,可能不仅仅会匹配到json,还会包含js代码
# 下面进行处理


# 切去js代码
leftt=0
rightt=0
indexxx=0
for index,charr in enumerate(json_data):
    if charr=='{':
        leftt+=1
    elif charr=='}':
        rightt+=1
    if leftt and (leftt==rightt):
        indexxx=index
        break
print(indexxx)

json_data=json_data[:indexxx+1]



# 现在的 ^^json_data^^就是标准的json格式
formatted_json = json.loads(json_data)
formatted_json=formatted_json['videoData']
formatted_json=formatted_json['pages']


def format_seconds(seconds):
    minutes, seconds = divmod(seconds, 60)
    hours, minutes = divmod(minutes, 60)
    return f"{hours}:{minutes}:{seconds:02}"

part=[i["part"] for i in formatted_json]
duration=[   format_seconds(i["duration"])  for i in formatted_json]




book = xlwt.Workbook(encoding='utf-8',style_compression=0)
sheet = book.add_sheet('Name',cell_overwrite_ok=True)



for index,(namee,timee) in enumerate(zip(part,duration)):
        sheet.write(index, 0,namee )
        sheet.write(index, 1,timee )

book.save(savepath)




# print(formatted_json)
# def func(obj):
#     for key in obj.keys():
#         print(key)
#         if(key=='pages'):
#             raise KeyError
#         if isinstance(obj[key],dict):
#             func(formatted_json)



# # 要写入的文本内容
# content=json_data
# # 指定文件名和模式('w'表示写入模式)
# file_name = "example.txt"

# # 使用with语句打开文件,确保文件操作完成后自动关闭文件
# with open(file_name, 'w',encoding='utf-8') as file:
#     file.write(content)

# print(f"Content has been written to {file_name} successfully.")
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值