原来是数据中出现了NaN,导致注入BigQuery失败。而注入BigQuery失败,默认的操作是立即重试,直到成功为止。
这就会导致注入操作一直卡在出错的地方,反复重试直至天荒地老,永远不能继续,后续字节一直积压。
如果用的是PubSub直接注入BigQuery,这个问题根本没有log可以看。哪里出现问题都不知道。
建议中间加一个DataFlow,照着官网Flex模板的示例自定义一个Flex模板。这样哪里出错了,起码有log可以看。
贴个简单的例子,读取PubSub流然后存入BQ
from __future__ import annotations
import argparse
import json
import logging
import time
from typing import Any
import apache_beam as beam
from apache_beam.io.gcp.bigquery_tools import RetryStrategy
from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam.transforms.window as window
import apache_beam.transforms.trigger as trigger
input_subscription = "xxx"
target_table = "xxx"
def parse_json_message(message: str) -> dict[str, Any]:
row = json.loads(message)
return row
def run(
beam_args: list[str] = None,
) -> None:
options = PipelineOptions(beam_args, save_main_session=True, streaming=True)
with beam.Pipeline(options=options) as pipeline:
# 从PubSub读取数据
datas = (
pipeline
| "Read from Pub/Sub"
>> beam.io.ReadFromPubSub(subscription=input_subscription).with_output_types(bytes)
| "UTF-8 bytes to string" >> beam.Map(lambda msg: msg.decode("utf-8"))
| "Parse JSON messages" >> beam.Map(parse_json_message)
)
# 所有数据存入BQ表
_ = (
datas
| "Write to BQ Table"
>> beam.io.WriteToBigQuery(target_table,
ignore_unknown_columns=True,
insert_retry_strategy=RetryStrategy.RETRY_ON_TRANSIENT_ERROR)
)
if __name__ == "__main__":
logging.getLogger().setLevel(logging.INFO)
parser = argparse.ArgumentParser()
args, beam_args = parser.parse_known_args()
run(
beam_args=beam_args,
)