推荐:数据工程的原则和推荐的项目结构

无论是传统的机器学习,深度学习还是 LLM,都离不开数据。而实际中很多数据项目组织混乱,缺乏指导和流程。要想做好数据工程,就需要遵守一定的规则,并建立良好的项目结构,这样才能确保我们的数据项目事半功倍。

这10条规则,需要仔细理解。
规则 1:从一开始就井然有序,并保持井然有序
规则 2:一切都来自某个地方,并且原始数据是不可变的
规则 3:版本控制是基本的专业素养
规则 4:笔记本(jupyter)用于探索,源文件(.py)用于重复
规则 5:测试和健全性检查可防止灾难
规则 6:大声失败,快速失败
规则 7:从原始数据到最终输出,项目运行完全自动化
规则 8:提取并集中重要参数
规则 9:项目运行默认是详细的,并产生有形的工件
规则 10:从最简单的端到端管道

另外需要建立一个合乎逻辑、合理标准化但又灵活的项目结构.如下图

例如数据始终位于 data/ 中,原始数据位于 data/raw/,用于分析的最终清理版本位于 data/processed/ 中。jupyter文件位于notebooks/ 中,我们鼓励使用编号方案来提供秩序感。项目 py代码位于 src/ 中,可以从jupyter中导入以鼓励重复数据删除和标准化。这种合理的结构有助于其他人理解、重现和扩展您的分析,并建立一种信任感,。

├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like make data or make train
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.

├── docs <- A default mkdocs project; see www.mkdocs.org for details

├── models <- Trained and serialized models, model predictions, or model summaries

├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator’s initials, and a short - delimited description, e.g.
1.0-jqp-initial-data-exploration.

├── pyproject.toml <- Project configuration file with package metadata for
│ {{ cookiecutter.module_name }} and configuration for tools like black

├── references <- Data dictionaries, manuals, and all other explanatory materials.

├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting

├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with pip freeze > requirements.txt

├── setup.cfg <- Configuration file for flake8

└── {{ cookiecutter.module_name }} <- Source code for use in this project.

├── init.py <- Makes {{ cookiecutter.module_name }} a Python module

├── config.py <- Store useful variables and configuration

├── dataset.py <- Scripts to download or generate data

├── features.py <- Code to create features for modeling

├── modeling
│ ├── init.py
│ ├── predict.py <- Code to run model inference with trained models
│ └── train.py <- Code to train models

└── plots.py <- Code to create visualizations

可以通过安装如下命令,快速建立项目结构

pip install cookiecutter-data-science
ccds

 ccds https://github.com/drivendata/cookiecutter-data-science
project_name (project_name):My Analysis
repo_name (my_analysis):my_analysis
module_name (my_analysis):
author_name (Your name (or your organization/company/team)):Dat A. Scientist
description (A short description of the project.):This is my analysis of the data.
python_version_number (3.10):3.12
Select dataset_storage
1 - none
2 - azure
3 - s3
4 - gcs
Choose from [1/2/3/4] (1):3
bucket (bucket-name):s3://my-aws-bucket
aws_profile (default):
Select environment_manager
1 - virtualenv
2 - conda
3 - pipenv
4 - none
Choose from [1/2/3/4] (1):2
Select dependency_file
1 - requirements.txt
2 - environment.yml
3 - Pipfile
Choose from [1/2/3] (1):1
Select pydata_packages
1 - none
2 - basic
Choose from [1/2] (1):2
Select open_source_license
1 - No license file
2 - MIT
3 - BSD-3-Clause
Choose from [1/2/3] (1):2
Select docs
1 - mkdocs
2 - none
Choose from [1/2] (1):1

在实践中,需要根据实际情况,不断优化数据项目的流程和结构,真正实现数据的端到端,可重复的生成过程,从而满足数据分析,机器学习的需要。

  • 18
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值