数据分析python代码开发工程化编写逻辑-个人总结

北丐安全

已于 2023-12-20 15:07:17 修改

阅读量1.2k

点赞数

文章标签：数据分析代码规范

于 2023-06-07 10:06:54 首次发布

本文链接：https://blog.csdn.net/ngadminq/article/details/131081798

版权

团队开发

git提交

提交方法

第1次提交方法:

需配置邮箱、git账号密码、在git上创建秘钥、在git上创建仓库

git init

git add .

git commit -m “提交代码备注”

git push

第2+次提交方法

git add .

git commit -m “提交代码备注”

git push

注意细节

上传代码应保证简洁、保证项目安全，按照本规范要求，会产生数据文件、环境配置、日志等。

因此需要额外增加.gitignore做好管理，使用方法

在.gitignore文件中指定要忽略的文件和目录。例如，如果您要忽略所有.log文件和node_modules目录，您可以在.gitignore文件中输入以下内容：

*.log
node_modules/

保存并关闭.gitignore文件。
打开Git Bash或其他终端，并进入您的Git项目的根目录。
使用以下命令将.gitignore文件添加到Git仓库中：

git add .gitignore

一份.gitignore文件示例。这个示例中包含了一些常见的规则，用于忽略Python项目中一些不必要的文件和目录，例如编译后的文件、Python虚拟环境、打包文件、IDE生成的文件等等。可以根据你的项目需要进行修改和定制。请注意，此示例可能并不适用于所有的Python项目，因此请根据你具体情况进行修改。

Byte-compiled / optimized / DLL files

pycache/
*.py[cod]
*$py.class

C extensions

*.so
*.pyd
*.dll

Distribution / packaging

.Python
env/
build/
dist/
egg-info/
.eggs/
bin/
include/
lib/
local/
pip-wheel-metadata/
share/
var/
*.egg-info/
*.egg
MANIFEST

PyCharm files

.idea/
*.iml
*.iws
*.ipr

Jupyter Notebook

.ipynb_checkpoints/

Unit test / coverage reports

htmlcov/
.coverage
.tox/

Sphinx documentation

docs/_build/

Django

*.log
*.pot
*.pyc
db.sqlite3
pycache/

Flask

instance/
webapp.egg-info/

🤔协同开发

pycharm专业版可以联合开发

细节

模块管理

项目下如需分配应有以下几个包：

根目录

程序运行的唯一地方，应简洁，且只包含少量文件，至少包括以下文件，其中py文件都需要在开头有文档注释。

train.py

pred.py

data

/数据集类别，所有数据集的文件夹命名只能使用英文

/script --如数据源来自某种下载方式，应在script中放入下载的sh脚本

/描述文档

配置

配置由yaml和argparse完成，区别是

yaml首先可以将全部参数都设置一个默认值，比如网络的层数，激活函数用哪个等等，大多是模型内相关的参数以及train和test使用的数据的地址。

argparse通常设置几个train和test时经常更改的参数，比如训练的epoch，batch_size，learning_rate…

argparse

如下就是标准的argparse用法，每个参数都应有help说明

yaml

argparse接收的是命令行的输入，所以优先级应该是会高一些；假如argparse和yaml文件中都有相同的参数，如果命令行指定了参数，那么代码运行时使用的参数是命令行输入的参数。

推荐使用yaml对需要做配置的模块做配置，将yaml文件放置于根目录下。

如下就是标准的yaml文件

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license
# VisDrone2019-DET dataset https://github.com/VisDrone/VisDrone-Dataset by Tianjin University
# Example usage: python train.py --data VisDrone.yaml
# parent
# ├── yolov5
# └── datasets
#     └── VisDrone  ← downloads here (2.3 GB)


# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
path: ../datasets/VisDrone  # dataset root dir
train: VisDrone2019-DET-train/images  # train images (relative to 'path')  6471 images
val: VisDrone2019-DET-val/images  # val images (relative to 'path')  548 images
test: VisDrone2019-DET-test-dev/images  # test images (optional)  1610 images

# Classes
names:
  0: pedestrian
  1: people
  2: bicycle
  3: car
  4: van
  5: truck
  6: tricycle
  7: awning-tricycle
  8: bus
  9: motor


# Download script/URL (optional) ---------------------------------------------------------------------------------------
download: |
  from utils.general import download, os, Path

  def visdrone2yolo(dir):
      from PIL import Image
      from tqdm import tqdm

      def convert_box(size, box):
          # Convert VisDrone box to YOLO xywh box
          dw = 1. / size[0]
          dh = 1. / size[1]
          return (box[0] + box[2] / 2) * dw, (box[1] + box[3] / 2) * dh, box[2] * dw, box[3] * dh

      (dir / 'labels').mkdir(parents=True, exist_ok=True)  # make labels directory
      pbar = tqdm((dir / 'annotations').glob('*.txt'), desc=f'Converting {dir}')
      for f in pbar:
          img_size = Image.open((dir / 'images' / f.name).with_suffix('.jpg')).size
          lines = []
          with open(f, 'r') as file:  # read annotation.txt
              for row in [x.split(',') for x in file.read().strip().splitlines()]:
                  if row[4] == '0':  # VisDrone 'ignored regions' class 0
                      continue
                  cls = int(row[5]) - 1
                  box = convert_box(img_size, tuple(map(int, row[:4])))
                  lines.append(f"{cls} {' '.join(f'{x:.6f}' for x in box)}\n")
                  with open(str(f).replace(os.sep + 'annotations' + os.sep, os.sep + 'labels' + os.sep), 'w') as fl:
                      fl.writelines(lines)  # write label.txt


  # Download
  dir = Path(yaml['path'])  # dataset root dir
  urls = ['https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-train.zip',
          'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-val.zip',
          'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-test-dev.zip',
          'https://github.com/ultralytics/yolov5/releases/download/v1.0/VisDrone2019-DET-test-challenge.zip']
  download(urls, dir=dir, curl=True, threads=4)

  # Convert
  for d in 'VisDrone2019-DET-train', 'VisDrone2019-DET-val', 'VisDrone2019-DET-test-dev':
      visdrone2yolo(dir / d)  # convert VisDrone annotations to YOLO labels

项目依赖管理

正式上线推荐使用Poetry而不是requirements.txt，因为它会更方便部署，包括自动创建虚拟环境等

使用方法

使用 poetry init 生成基本的 pyproject.toml 文件。你可以在这个过程中跳过依赖的添加，因为我们将在下一步添加所有依赖。
```
poetry init
```
添加依赖库：
```
bashCopy code
poetry add package_name
```
添加依赖库：
```
poetry add package_name
```
这将把指定的库添加到项目的 pyproject.toml 文件中。
移除依赖库：
```
poetry remove package_name
```
这将从项目的 pyproject.toml 文件中移除指定的库。
安装项目依赖：
```
poetry install
```
这将根据 pyproject.toml 文件中的依赖创建一个虚拟环境，并在其中安装库。
更新项目依赖：
```
poetry update
```
这将根据 pyproject.toml 文件中的依赖更新 poetry.lock 文件，并重新安装虚拟环境中的库。
运行项目中的脚本：
```
poetry run python your_script.py
```
这将在虚拟环境中运行指定的 Python 脚本。
打包项目：
```
poetry build
```

这将为项目创建一个源代码分发包（.tar.gz）和一个轮子分发包（.whl）。

文档

README文档

项目至少要有一个README文档，放置于项目根目录下，来解释项目的结构、作用、运行方式。

若代码与机理性强挂钩或数据集复杂，至少再加一个描述性文件对其进行说明

若模块复杂，应在模块下加上README的描述

# 项目名称

简短的项目描述，阐述项目的目的和目标。

## 目录

- [背景](#背景)
- [安装](#安装)
- [快速开始](#快速开始)
- [数据集](#数据集)
- [方法和技术](#方法和技术)
- [结果和可视化](#结果和可视化)
- [参考资料](#参考资料)
- [许可证](#许可证)
背景

详细介绍项目的背景和动机，解释为什么该项目对数据分析领域有价值。

项目目录介绍

The ChatGPT Retrieval Plugin repository provides a flexible solution for semantic search and retrieval of personal or organizational documents using natural language queries. The repository is organized into several directories:

Directory Description
datastore Contains the core logic for storing and querying document embeddings using various vector database providers.
docs Includes documentation for setting up and using each vector database provider, webhooks, and removing unused dependencies.
examples Provides example configurations, authentication methods, and provider-specific examples.
models Contains the data models used by the plugin, such as document and metadata models.
scripts Offers scripts for processing and uploading documents from different data sources.
server Houses the main FastAPI server implementation.
services Contains utility services for tasks like chunking, metadata extraction, and PII detection.
tests Includes integration tests for various vector database providers.
.well-known Stores the plugin manifest file and OpenAPI schema, which define the plugin configuration and API specification.

安装

列出项目所需的依赖库，并提供安装方法。例如，通过 requirements.txt 文件安装：
pip install -r requirements.txt
快速开始

提供项目的基本使用说明，包括如何运行代码、使用 Jupyter Notebook 等。例如：
bashCopy code# 克隆项目仓库
git clone https://github.com/username/project_name.git

# 进入项目目录
cd project_name

# 安装依赖
pip install -r requirements.txt

# 运行示例脚本
python example_script.py
数据集

简要介绍数据集的来源、格式和特点。如果可能，提供数据集的下载链接。

方法和技术

列出项目中使用的数据分析、数据处理、建模和评估方法，以及实现这些方法所使用的技术和库。

结果和可视化

展示项目的分析结果，包括数据可视化、统计分析等。如果有必要，可以添加图片、表格和代码段。

参考资料

列出项目中使用到的参考资料，包括论文、书籍、博客文章和教程等。

Directory	Description
`datastore`	Contains the core logic for storing and querying document embeddings using various vector database providers.
`docs`	Includes documentation for setting up and using each vector database provider, webhooks, and removing unused dependencies.
`examples`	Provides example configurations, authentication methods, and provider-specific examples.
`models`	Contains the data models used by the plugin, such as document and metadata models.
`scripts`	Offers scripts for processing and uploading documents from different data sources.
`server`	Houses the main FastAPI server implementation.
`services`	Contains utility services for tasks like chunking, metadata extraction, and PII detection.
`tests`	Includes integration tests for various vector database providers.
`.well-known`	Stores the plugin manifest file and OpenAPI schema, which define the plugin configuration and API specification.

日志配置

docs（可选）

由于数据分析项目与数据、应用强相关性，看是否有必要在项目的背景说明等，格式为md

决定是否采用docker

Poetry 主要用于纯Python项目，它强调依赖项管理和项目构建。它特别适用于纯Python应用程序，例如库、CLI工具或Web应用程序，不涉及复杂的容器化、数据库管理或其他特殊需求。Poetry可以帮助您管理项目的依赖项、创建虚拟环境，并且可以方便地导出依赖项清单。

Docker 通常用于包含多个组件、服务或依赖项的项目，例如包括数据库、消息队列、Web服务器等的应用程序。Docker容器提供了一种将应用程序及其依赖项隔离、打包和部署的方式。这使得Docker适用于更复杂的应用程序堆栈，可以在不同环境中轻松部署。

对于大多数纯Python项目，Poetry可能是一个方便且有效的依赖项管理工具。当您的项目开始涉及到数据库、消息队列、Web服务器等其他服务时，Docker可以成为一个有用的工具，它允许您将整个应用程序和其依赖项打包在一个可移植的容器中，从而简化了部署和环境管理。通常，Poetry和Docker可以结合使用，以便更好地管理Python依赖项并确保容器中的Python环境与项目一致。
docker很方便，一般默认采用，但如果有以下情况可以不采用

以下是一些不建议使用Docker的情况：

高性能计算需求
如果算法项目需要大量计算资源，特别是对GPU资源有高效直接访问的需求（如深度学习模型），直接在物理机上运行可能比在Docker容器中运行效率更高。虽然Docker支持GPU，但可能会有额外的配置复杂性和性能损耗。
简单或单次使用的脚本
对于简单的、一次性或很少变更的Python脚本，使用Docker可能是过度的。这种情况下，直接在本地环境或虚拟环境中运行更为简便。
极端资源约束环境
在资源非常有限的环境中（如极小的物理机或虚拟机），Docker运行可能会由于资源占用过高而不太适合。
与本地设备紧密集成
如果项目需要与本地硬件（如特定的外设或接口）紧密集成，直接在主机操作系统上运行可能更简单，因为在Docker容器中处理硬件直接访问可能会增加额外的复杂性。
安全或合规要求
在某些高安全性或特定合规要求的环境下，直接在物理机器上运行可能更符合要求。虽然Docker提供了隔离，但在某些情况下，物理隔离可能被认为是更安全的选项。
简单的本地开发环境
对于只涉及基本Python库的项目，特别是在教育或初学者的环境中，使用Docker可能会增加不必要的复杂性。
特定的操作系统依赖
如果项目有特定操作系统的依赖，且这些依赖不容易在Docker容器中复制，直接在相应的操作系统上运行可能更合适。

总结
在考虑是否使用Docker时，需要权衡项目的具体需求、性能要求、资源限制、安全合规考虑以及开发和运维的复杂性。如果项目不涉及复杂的依赖管理，不需要频繁迭代或部署，且运行环境相对简单和固定，那么使用Docker可能是不必要的。

北丐安全

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
数据分析python代码开发工程化编写逻辑-个人总结

团队开发git提交提交方法注意细节Byte-compiled / optimized / DLL filesC extensionsDistribution / packagingPyCharm filesJupyter NotebookUnit test / coverage reportsSphinx documentationDjangoFlask:thinking:协同开发细节模块管理根目录train.pypred.pydata配置argparseyaml
复制链接

扫一扫