最新推荐文章于 2024-09-04 16:47:30 发布

曾阿伦

最新推荐文章于 2024-09-04 16:47:30 发布

阅读量138

点赞数

分类专栏： python 文章标签： aws pandas python

本文链接：https://blog.csdn.net/zlhblogs/article/details/130232460

版权

python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

思路：
1、使用s3fs获取目录下的所有文件及文件夹
2、使用正则获取最末级目录
3、使用fs.info查看文件详情，得到修改时间
4、比对文件日期，大于昨天的就是有更新的

import datetime
import pandas as pd
import json
import s3fs
import pytz
import re
from datetime import datetime, timedelta
import time

yesterday = (datetime.today() - timedelta(days=1)).strftime('%Y%m%d')
# yesterday = '19700101'

start_time = time.time()
fs = s3fs.S3FileSystem(client_kwargs={"region_name": "cn-northwest-1"})
df = pd.DataFrame(columns=['Value'])
for parent, dirs, files in fs.walk(
        "s3://emr.s3.zlh.com/usr/zlh/_zt_user_log/"):
    flag = False
    # 只要day目录
    regex = re.search(r'day=\d.?', parent)
    if regex:
        for file in files:
            pathfile = f"s3://{parent}/{file}"
            file_info = fs.info(pathfile)
            # 将 LastModified 转换为 datetime 对象
            last_modified_time = file_info['LastModified'].replace(tzinfo=None)
            utc_time = pytz.utc.localize(last_modified_time)

            # 将 UTC 时间转换为东八区时间
            local_tz = pytz.timezone('Asia/Shanghai')
            s3_time = utc_time.astimezone(local_tz).strftime('%Y%m%d')
            # 如果s3文件更新的时间大于昨天，即今天有更新,使用大于而不是等于主要方便用于故障重跑
            if s3_time > yesterday:
                flag = True
                break
        # 只要有一个更新的当天数据就需要去重
        if flag:
            new_df = pd.DataFrame({'Value': [f"s3://{parent}"]})
            df = pd.concat([df, new_df], ignore_index=True)

# 经过优化增量后自动存入增量的地址，无需python检测，此脚本废弃
if len(df) > 0:
    df.to_json('s3://emr.s3.zlh.com/usr/zhanglh/flag_list/log.json', orient="records")

end_time = time.time()
elapsed_time = end_time - start_time
print(f"总共用时: {elapsed_time:.2f} seconds")