推荐：数据工程的原则和推荐的项目结构

技术与健康

于 2024-08-06 08:23:10 发布

阅读量623

点赞数 18

分类专栏： ML 文章标签：机器学习

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/Practicer2015/article/details/140937894

版权

ML 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

无论是传统的机器学习，深度学习还是 LLM，都离不开数据。而实际中很多数据项目组织混乱，缺乏指导和流程。要想做好数据工程，就需要遵守一定的规则，并建立良好的项目结构，这样才能确保我们的数据项目事半功倍。

这10条规则，需要仔细理解。
规则 1：从一开始就井然有序，并保持井然有序
规则 2：一切都来自某个地方，并且原始数据是不可变的
规则 3：版本控制是基本的专业素养
规则 4：笔记本(jupyter)用于探索，源文件(.py)用于重复
规则 5：测试和健全性检查可防止灾难
规则 6：大声失败，快速失败
规则 7：从原始数据到最终输出，项目运行完全自动化
规则 8：提取并集中重要参数
规则 9：项目运行默认是详细的，并产生有形的工件
规则 10：从最简单的端到端管道

另外需要建立一个合乎逻辑、合理标准化但又灵活的项目结构.如下图

例如数据始终位于 data/ 中，原始数据位于 data/raw/，用于分析的最终清理版本位于 data/processed/ 中。jupyter文件位于notebooks/ 中，我们鼓励使用编号方案来提供秩序感。项目 py代码位于 src/ 中，可以从jupyter中导入以鼓励重复数据删除和标准化。这种合理的结构有助于其他人理解、重现和扩展您的分析，并建立一种信任感，。

├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like make data or make train
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default mkdocs project; see www.mkdocs.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator’s initials, and a short - delimited description, e.g.
│ 1.0-jqp-initial-data-exploration.
│
├── pyproject.toml <- Project configuration file with package metadata for
│ {{ cookiecutter.module_name }} and configuration for tools like black
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with pip freeze > requirements.txt
│
├── setup.cfg <- Configuration file for flake8
│
└── {{ cookiecutter.module_name }} <- Source code for use in this project.
│
├── init.py <- Makes {{ cookiecutter.module_name }} a Python module
│
├── config.py <- Store useful variables and configuration
│
├── dataset.py <- Scripts to download or generate data
│
├── features.py <- Code to create features for modeling
│
├── modeling
│ ├── init.py
│ ├── predict.py <- Code to run model inference with trained models
│ └── train.py <- Code to train models
│
└── plots.py <- Code to create visualizations

可以通过安装如下命令，快速建立项目结构

pip install cookiecutter-data-science
ccds

 ccds https://github.com/drivendata/cookiecutter-data-science
project_name (project_name):My Analysis
repo_name (my_analysis):my_analysis
module_name (my_analysis):
author_name (Your name (or your organization/company/team)):Dat A. Scientist
description (A short description of the project.):This is my analysis of the data.
python_version_number (3.10):3.12
Select dataset_storage
1 - none
2 - azure
3 - s3
4 - gcs
Choose from [1/2/3/4] (1):3
bucket (bucket-name):s3://my-aws-bucket
aws_profile (default):
Select environment_manager
1 - virtualenv
2 - conda
3 - pipenv
4 - none
Choose from [1/2/3/4] (1):2
Select dependency_file
1 - requirements.txt
2 - environment.yml
3 - Pipfile
Choose from [1/2/3] (1):1
Select pydata_packages
1 - none
2 - basic
Choose from [1/2] (1):2
Select open_source_license
1 - No license file
2 - MIT
3 - BSD-3-Clause
Choose from [1/2/3] (1):2
Select docs
1 - mkdocs
2 - none
Choose from [1/2] (1):1

在实践中，需要根据实际情况，不断优化数据项目的流程和结构，真正实现数据的端到端，可重复的生成过程，从而满足数据分析，机器学习的需要。

技术与健康

关注

18
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
推荐：数据工程的原则和推荐的项目结构

例如数据始终位于 data/ 中，原始数据位于 data/raw/，用于分析的最终清理版本位于 data/processed/ 中。这种合理的结构有助于其他人理解、重现和扩展您的分析，并建立一种信任感，。要想做好数据工程，就需要遵守一定的规则，并建立良好的项目结构，这样才能确保我们的数据项目事半功倍。在实践中，需要根据实际情况，不断优化数据项目的流程和结构，真正实现数据的端到端，可重复的生成过程，从而满足数据分析，机器学习的需要。规则 9：项目运行默认是详细的，并产生有形的工件。
复制链接

扫一扫

专栏目录