系列文章目录
例如:第一章 Python 机器学习入门之pandas的使用
文章目录
一、上传本地文件到Blob里
1. 直接通过程序上传
1.1 上传单个文件到blob里
- 这种方式只能一次上传一个文件,如果需要上传多个文件,需要循环
- 创建环境变量
.env
文件,里面存放blob相关链接信息
STORAGE_ACCOUNT_KEY = "7jKEZhLY/L85wsnCsS7edS6f+UP0JKXP+3eFvSn5DxmU96sAStYFiaPQ=="
ACCOUNT_NAME = "dl201lg"
CONNCETION_STRING = "DefaultEndpointsProtocol=https;AccountName=dl201lg;AccountKey=IzwwBsT2QV/4n;net"
CONTAINER_NAME = "raw"
- 创建上传文件的程序
upload_files.py
import os
from azure.storage.blob import BlobServiceClient
from dotenv import load_dotenv
# 加载环境变量
load_dotenv()
def upload_file_to_blob(file_path: str, container_name: str, blob_name: str):
# 从环境变量中获取连接字符串
connection_string = os.getenv('CONNCETION_STRING')
if not connection_string:
raise ValueError("AZURE_STORAGE_CONNECTION_STRING environment variable not set")
# 创建 Blob 服务客户端
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# 获取容器客户端
container_client = blob_service_client.get_container_client(container_name)
# 创建容器,如果容器不存在
try:
container_client.create_container()
except Exception as e:
print(f"Container {container_name} already exists.")
# 获取 Blob 客户端
blob_client = container_client.get_blob_client(blob_name)
# 上传文件
with open(file_path, "rb") as data:
blob_client.upload_blob(data, overwrite=True)
print(f"{file_path} uploaded to blob storage as {blob_name} in container {container_name}.")
# 配置参数
file_path = r"D:\DE_NEW\dataset\dataset\2022-02-10\2022-02-10_BINS_XETR11.csv"
container_name = os.getenv('CONTAINER_NAME') # CONTAINER 名称
blob_name = "2022-02-10_BINS_XETR11.csv" # 在 Blob 存储中的文件名称
# 上传文件
upload_file_to_blob(file_path, container_name, blob_name)
1.2 上传多个文件到blob里
- 上传文件夹下的所有文件到blob里,并保持该文件夹的结构
import os
from azure.storage.blob import BlobServiceClient
def upload_folder_to_blob(local_folder_path: str, container_name: str, connection_string: str):
# 创建 Blob 服务客户端
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# 获取容器客户端
container_client = blob_service_client.get_container_client(container_name)
# 创建容器,如果容器不存在
try:
container_client.create_container()
except Exception as e:
print(f"Container {container_name} already exists.")
# 遍历文件夹中的所有文件并上传
for root, _, files in os.walk(local_folder_path):
for file_name in files:
file_path = os.path.join(root, file_name)
blob_name = os.path.relpath(file_path, local_folder_path).replace("\\", "/") # 在 Blob 中保持相对路径
blob_client = container_client.get_blob_client(blob_name)
# 上传文件
with open(file_path, "rb") as data:
blob_client.upload_blob(data, overwrite=True)
print(f"{file_path} uploaded to blob storage as {blob_name}.")
# 配置参数
local_folder_path = r"D:\DE_NEW\dataset" # 本地文件夹路径
container_name = "your_container_name" # Azure Blob 容器名称
connection_string = "your_connection_string" # Azure Blob 存储连接字符串
# 上传文件夹中的所有文件
upload_folder_to_blob(local_folder_path, container_name, connection_string)
1.3 上传超大型文件
- 上传超大型的文件 最大可支持20G
from azure.storage.blob import BlobServiceClient, BlobBlock
import uuid
import os
class BlobUploader:
def __init__(self, connection_string, container_name):
self.blob_service_client = BlobServiceClient.from_connection_string(connection_string)
self.container_client = self.blob_service_client.get_container_client(container_name)
def upload_file_chunks(self, blob_file_path, local_file_path):
'''
Upload large file to blob
'''
try:
blob_client = self.container_client.get_blob_client(blob_file_path)
# upload data
block_list = []
chunk_size = 1024 * 1024 * 4 # 4MB
with open(local_file_path, 'rb') as f:
while True:
read_data = f.read(chunk_size)
if not read_data:
break # done
blk_id = str(uuid.uuid4())
blob_client.stage_block(block_id=blk_id, data=read_data)
block_list.append(BlobBlock(block_id=blk_id))
blob_client.commit_block_list(block_list)
print(f"File {local_file_path} uploaded to {blob_file_path} successfully.")
except BaseException as err:
print('Upload file error')
print(err)
# 配置参数
connection_string = "your_connection_string" # Azure Blob 存储连接字符串
container_name = "your_container_name" # Azure Blob 容器名称
blob_file_path = "your_blob_file_path" # 目标 Blob 文件路径
local_file_path = "your_local_file_path" # 本地文件路径
# 创建 BlobUploader 实例并上传文件
uploader = BlobUploader(connection_string, container_name)
uploader.upload_file_chunks(blob_file_path, local_file_path)
1.4 使用Azure Key Vault 上传
2. 使用ADF copy上传
2.1 打开IR读取本地机器文件的权限
管理员身份运行进入到IR的本机地址C:\Program Files\Microsoft Integration Runtime\5.0\Shared
然后运行:
PS C:\Program Files\Microsoft Integration Runtime\5.0\Shared> .\dmgcmd.exe -DisableLocalFolderPathValidation
2.2 设置copy pipeline
- source设置:由于不知道该文件夹的子文件夹结构,我们需要进行递归的,通配选择所有的csv文件进行COPY
- sink设置,由于我们存储的raw文件,所以需要设置保持和原文件夹结构一致
3. 使用多线程Copy数据,针对大量文件,该功能额外收费
4. 本次一共上传了80多个文件夹,共计3000个文件,大小1.02GB,总共用时1小时
二、Data ingestion
- 我们可以使用yaml文件生成json配置文件,将配置文件上传到blob里,在由adf读取该json动态的进行数据的copy
- 也可以类似于建立一张waterMark表,用来管理所有的数据流动
1. 将各个类型的数据从source里转到项目的blob里
1.1 使用databricks将pdf转为csv
- 将约有69页pdf转为csv文件
import tabula
from datetime import date
import pandas as pd
pdf_path = '/dbfs/mnt/source/airport/plan.pdf'
now_date = date.today()
output_path = '/dbfs/mnt/airline/raw/plan/2024-07-16/plan.csv'
df_list = tabula.read_pdf(pdf_path, pages='all')
expected_columns = ["tailnum", "type", "manufacturer", "issue_date", "model", "status", "aircraft_type", "engine_type", "year"]
df_list_processed = []
for df in df_list:
df_no_head = df[1:]
df_no_head.columns = expected_columns
df_list_processed.append(df_no_head)
combined_df = pd.concat(df_list_processed, ignore_index=True)
print(combined_df)
combined_df.to_csv(output_path,index=False)
- databricks在使用python程序读取blob里的数据时,需要加/dbfs,使用文件系统
- 在合并多个pdf时候,出现无法合并成指定列的时候,给pdf强制增加表头,可以合并
- databricks使用tabula无法合并,需要将df的列表,拆开处理