从0开始搞个流水线-CSDN博客

本文链接：https://blog.csdn.net/weixin_43197755/article/details/141262612

你一个dev，怎么就要管流水线呢

经过几年的不懈CRUD，作者终于成为了一个成功的feature developer，最近TL就找我说，哎，咱们这敏捷项目你光写feature不太够用呐，试试接几个搞流水线的卡？后面你可能也要搞这个啊。我寻思我一个上海的搞后端的dev怎么就要搞ops的活儿了，而且我的朋友们也管一些流水线部署的工作，按照他们的吐槽，搞流水线就得沾上：半夜上线💻、全组围观🌚、线上热更♨️、我偏手操🪬、反自动化🙅...听起来就头大。正准备开腔婉拒呢，转念一想，devops这玩意儿也是火了几年了，我要是一直不搞，纯纯做个feature developer没啥意义啊，那么，本着既然要做就要好好做的原则，今天就好好盘一盘流水线的那些事儿

那这个项目，背景是啥呢

我们的项目流水线使用Apache Beam来定义/执行数据处理任务，通过Buildkite设置CI/CD pipeline，在代码提交时自动运行测试、检查，构建Docker镜像，并将数据处理任务部署到Google Cloud Dataflow（GCD）。由此看来，我们开发完毕后，部署人员只需根据代码中定义的操作选择必要的参数就可以运行了（🤔这么看起来也没有手操哥发挥的余地呐）

根据上面的介绍，可以看出我们项目在GCD上部署并应该能够联想到其与Google Cloud Platform（GCP）的其他服务（如Pub/Sub、BigQuery、Cloud Storage等等）高度耦合，这可能与AWS那一套有些许不一样，所以用AWS的朋友们可就仅供参考啦～

话不多说，上才艺！

我们将从buildkite配置、docker配置以及dataflow三个部分来介绍怎么从0开始搞一条简单的流水线

1. Buildkite下的数据检查与步骤自定义

Buildkite是个相当强大的CI/CD平台，我们只需要配置一些yaml文件并在其中自定义我们想要构建的流水线的步骤就能轻松运行，为了能让跑流水线的人员不搞幺蛾子更丝滑的运行，我们就要好好地说说buildkite支持的那些标签：

插件系统：steps下的子标签 plugins，用于集成第三方的插件，比如我们常用的docker、shellcheck、sonarqube等等，如：

  - name: 'Sonarqube scan'
    agents:
      queue: sonarqube-scanner-build:queue
    plugins:
      ssh://git@git.xxx/buildkite-plugin/sonarqube-buildkite-plugin#v1.8.1:
        projectkey: my-test-pipeline
        projectname: test/my_test

这里我们就加了一个sonarqube插件用于扫描，其他的插件同理可添加到整条流水线适当的步骤中。

条件执行：key + depends on + command，我们可以通过key标签下设置选项并在其后的depends on中通过key中的选项传给command标签中指定的shell脚本，通过shell中的条件检测来配置在特定条件下执行的步骤，比如：

steps:
  - input: "Options for Update User Leaders"
    key: "update-options"
    fields:
      - text: "Backup Table Suffix e.g. '_BACKUP_TIMESTAMP'"
        key: "backup_table_suffix"
        required: true
      - text: "Branch Code e.g. 'Tech111'"
        key: "branch_code"
        required: true
      - select: "Branch manager update flag e.g. 'true' or 'false'"
        key: "is_update_branch_manager"
        options:
          - label: "True"
            value: true
          - label: "False"
            value: false
        required: true
      - select: "Branch leader update flag e.g. 'true' or 'false'"
        key: "is_update_branch_leader"
        options:
          - label: "True"
            value: true
          - label: "False"
            value: false
        required: true

  - label: "Check if related pipeline config required"
    depends_on: "update-options"
    command: '.buildkite/pipelines/update_user_leader/commands/user_leaders_pipeline_options_check'
    agents:
      queue: "pipeline-uploader"

我们在这里定义了两个选项，代表了我们想要运行的数据处理操作：更新部门经理和/或更新部门领导并且在下面的标签中定义了一个流水线校验。当我们开始运行流水线并作出选择后，就会跳到user_leaders_pipeline_options_check 中进行参数检测：这样就完成了一个条件执行的操作：都勾选就走update both步骤，否则就勾选哪个执行哪个。

可能这里会有人好奇：☝🏼️🤓哎！为啥不用两个if判断来让流水线根据我们勾选的内容去跑对应的后续操作而专门把两项全选新搞了一个yaml呢？其实是因为我们在选择完所有的选项之后，最后会有一个步骤，即上传流水线并运行数据处理流程，那么如果我们在前面使用两个if做选择，当出现既更新部门领导又更新部门经理的情况时，根据这里的选择它会试图在这一个流程中同时运行两个数据处理工作的yaml：

- label: "Upload Dataflow run steps"
 depends_on: "update-branch-manager-options"
 command: '.buildkite/upload_run_dataflow'
 agents:
   queue: "pipeline-uploader"

- label: Upload Dataflow run steps"
 depends_on: "update-branch-leader-options"
 command: '.buildkite/upload_run_dataflow'
 agents:
   queue: "pipeline-uploader"

当出现这种情况时流水线将停止运行（这也很好想象，同时运行多个数据处理流程很容易出现数据覆盖，类似于数据库的不可重复读问题/多线程下同时更新数据问题，我们不能接受在生产环境下有这种问题）

自定义步骤：buildkite支持steps标签来让我们根据需要添加自己想要的流水线步骤，这个特点可以和前面我们所说的支持插件紧密结合起来，为开发人员带来很多便利。如：

图中我不仅自定义了想要做的必要检测步骤，还对其进行了分组，group标签会在流水线运行的时候放在一个tab里：

2. Docker 配置

在所经历的项目中，我们还是使用docker来构建镜像更多一些，当然对于流水线来说，dockerfile就比较简单直观：足够项目流水线所需的依赖即可。比如说我们的项目跑在谷歌云平台（GCP）上，用Apache Beam来写数据处理流程，那我们就仅需要在Dockerfile中配置上这些包，下面就是一个配置的例子：

FROM python:3.11-slim-bullseye

WORKDIR /root

# bash for run command
RUN apt-get update && apt-get -y install bash

#curl for reaching GCP CLI
RUN apt-get update && apt-get -y install curl

# Install GCP CLI
RUN curl -sSL https://sdk.cloud.google.com | bash
ENV PATH $PATH:/path/to/google-cloud-sdk/bin

# Install Java JRE for emulators
RUN apt-get update && apt-get install -y openjdk-11-jre

RUN gcloud components install beta
RUN gcloud components install cloud-datastore-emulator
RUN gcloud components install pubsub-emulator

RUN pip install wheel
RUN pip install ruff
RUN pip install apache-beam[gcp]

RUN mkdir /my_test
WORKDIR /my_test

COPY ../test ./test

CMD ["bash", "-c"]

这个配置就能让我们在一个隔离的环境中测试自己搞得数据处理流程成不成功。当然，docker-compose就更简单：

services:
 dev:
   build:
     context: ..
     dockerfile: .docker/Dockerfile-test
   image: my-test:latest

3. 数据处理流程

从上面不难看出，项目中我们用Apache Beam来实现数据处理流程，在写python脚本的时候，就不得不提我心目中的Apache Beam的两个核心：DoFn和ParDo，可以说它们俩就是我们处理数据流程的全部操作了，ParDo是 Beam 的一个 transform，用于对数据集合中的每个元素应用 DoFn，而DoFn则是一个类，我们通常通过实现其process方法对数据集进行自定义处理，比如我们要更新员工的直属领导：

脚本入口

常写流水线的朋友们都知道，脚本入口只需把我们这个流水线的配置传进去就好了，为了使代码清晰，一般核心方法都用run()：

def run():
   pipeline_options = UpdateUserInfoPipelineOptions()
   pipeline = UpdateUserInfo()
   pipeline.run(pipeline_options)

if __name__ == '__main__':
   doSomeLog()
   run()

拿数据集准备操作

流水线开始工作后我们首先要介绍ParDo这个强大的处理函数，它能够对我们需要处理的数据集合应用自定义的处理操作：

class UpdateUserInfo:
   def run(self, pipeline_options: UpdateUserInfoPipelineOptions):
       logging.info('Pipeline Options: %s' % pipeline_options)

       start_time = datetime.now()
       task_name = pipeline_options.task_name.get()

       backUpData()

       if task_name == constants.UPDATE_USER_LEADER:
           self.update_user_leader(pipeline_options)

       end_time = datetime.now()
       logging.info('Start: %s End: %s Duration: %s' % (start_time, end_time, end_time - start_time))

   def update_user_leader(self, pipeline_options):
       project_id = pipeline_options.project_id.get()
       backup_table_suffix = pipeline_options.backup_table_suffix.get()
       branch_code = pipeline_options.branch_code.get()

       is_update_BM = pipeline_options.is_update_branch_manager.get().lower() == 'true'
       source_BM_id = pipeline_options.source_branch_manager_id.get() if is_update_BM else None
       target_BM_id = pipeline_options.target_branch_manager_id.get() if is_update_BM else None

       is_update_BL = pipeline_options.is_update_branch_leader.get().lower() == 'true'
       source_BL_id = pipeline_options.source_branch_leader_id.get() if is_update_BL else None
       target_BL_id = pipeline_options.target_branch_leader_id.get() if is_update_BL else None

       doLogs()

       query = Query(project=project_id, kind=constants.CONTACT_KIND,
                     filters=[("Branch.BranchCode", "=", branch_code)])

       pipeline = Pipeline(options=pipeline_options)

       # Read data from Employee table
       ds_records = pipeline | 'Read Employee from Datastore' >> ReadFromDatastore(query=query)

       # Generate backup Employee and update Employee leader
       backup_and_updated_records = records | 'Generate backup records and updated records' >> beam.ParDo( UpdateUserLeaderDoFn(kind=constants.CONTACT_KIND,
       backup_table_suffix=backup_table_suffix, branch_code=branch_code,
       is_update_branch_manager=is_update_BM, is_update_branch_leader=is_update_BL,
       source_branch_manager_id=source_BM_id, target_branch_manager_id=target_BM_id,
       source_branch_leader_id=source_BL_id, target_branch_leader_id=target_BL_id))

       # Backup Employee in Datastore
       backup_and_updated_records | 'Get backup entities' >> beam.Map(lambda x: x['backup_entity']) \
       | 'Write to Employee Backup Table' >> WriteToDatastore(project=project_id)

       # Get updated user leader in Datastore
       backup_and_updated_records | 'Get updated entities' >> beam.Map(lambda x: x['updated_entity']) \
       | 'Write to Employee Table' >> WriteToDatastore(project=project_id)

       pipeline.run().wait_until_finish()

这里我们能看到我们在update_user_leader()中通过Apache beam的ParDo对每一个需要更新直属领导的user都使用了核心处理单元UpdateUserLeaderDoFn()以达到批量更新多条信息的目的。

核心处理操作

在DoFn内部，我们一般通过实现process方法来达到更新数据的目的：

def process(self, entity: Entity):
   updated_entity = copy.deepcopy(entity)
   if self.is_update_branch_manager:
       updated_entity = self.update_branch_manager(updated_entity)

   if self.is_update_branch_leader:
       updated_entity = self.update_branch_leader(updated_entity)

   backupAndLog()

       yield {
           'backup_entity': backup_entity,
           'updated_entity': updated_entity
       }

def update_branch_manager(self, entity: Entity):
   branches = entity.properties.get('Branch', [])
   for branch in branches:
       if (branch.get('BranchCode') == self.branch_code and
               branch.get('BranchManagerId') == self.source_branch_manager_id):
           branch['BranchManagerId'] = self.target_branch_manager_id
   return entity

def update_branch_leader(self, entity: Entity):
   branches = entity.properties.get('Branch', [])
   for branch in branches:
       if (branch.get('BranchCode') == self.branch_code and
               branch.get('BranchLeaderId') == self.source_branch_leader_id):
           branch['BranchLeaderId'] = self.target_branch_leader_id
   return entity

我们在DoFn内的process方法中根据流水线配置来判断到底更新哪个值并将备份数据与更新后数据通过yield {}返回字典。