利用谷歌云DATAFLOW产品把BUCKET中的JSON文件转化为CSV

最新推荐文章于 2024-02-05 22:53:49 发布

张凌飞

最新推荐文章于 2024-02-05 22:53:49 发布

阅读量416

点赞数 1

分类专栏： Google Cloud dataflow 大数据分析

原文链接：https://medium.com/@alexanderfragotsis/transform-json-to-csv-from-google-bucket-using-a-dataflow-python-pipeline-92906c87cc96

版权

Google Cloud 同时被 3 个专栏收录

3 篇文章 2 订阅

订阅专栏

大数据分析

3 篇文章 0 订阅

订阅专栏

dataflow

1 篇文章 0 订阅

订阅专栏

在这里插入图片描述
在这篇文章中我们把一个JSON文件要转化为CSV文件，使用工具是GCP DATAFLOW和PYTHON。
首先我们需要创建一个服务账户SERVICE ACCOUNT，这个服务账号代表程序运行我们起名叫做“DATAFLOW WORKER“角色，还需要输出成一个JSON文件备用。

在这里插入图片描述

点击下载后得到本地存储的服务账户JSON密钥文件

The JSON file from the created Service Account is import to operate from local terminal rather than cloud Shell. So after downloading the service account JSON we need to set it up as an environment variable in local terminal shell.

从创建的服务帐户导入的JSON文件可以从本地终端（而不是Cloud Shell）进行操作。因此，下载服务帐户JSON之后，我们需要将其设置为本地终端外壳程序中的环境变量。

# export GOOGLE_APPLICATION_CREDENTIALS="path of the json from download GCP"
# path example: "/Users/afragotsis/Downloads/my_project.json"
export GOOGLE_APPLICATION_CREDENTIALS="/home/zhanglingfei/Downloads/dataflow_job_steven/PythonDataflow/gentle-dominion-245803-61d0799416fa.json"

For instance the JSON we parse having the following format:
在这里插入图片描述
now we can have the pipeline written.

Stage 1：

will read JSON file from GS bucket directly and using smart_open for this.

from google.cloud import storage
from smart_open import open

class ReadFile(beam.DoFn):

    def __init__(self, input_path):
        self.input_path = input_path

    def start_bundle(self):
        self.client = storage.Client()

    def process(self, something):
        clear_data = []
        with open(self.input_path) as fin:
            #for line in fin: 可以不用循环 不过还需测试大量json为好
                data = json.load(line) # 这里和原作者写法不同，经过修改有效
                product = data.get('product')

                if product and product.get('id'):
                    product_id = str(product.get('id'))
                    title = product.get('title')
                    vendor = product.get('vendor')
                    product_type = product.get('product_type')
                    updated_at = product.get('updated_at')
                    created_at = product.get('created_at')
                    product_options = product.get('options')

                    option_ids = []
                    if product_options:
                        for option in product_options:
                            option_ids.append(option.get('id'))

                    clear_data.append([product_id, title,vendor, product_type, updated_at, created_at, option_ids])

        yield clear_data

yield clear_data at the end of code above will pass the array we created from this stage to next stage of the pipeline.

stage 2

stage 2 will get as-input array from cleared data and transform & write them into CSV file to the output path using pandas.

import pandas as pd
from google.cloud import storage
class WriteCSVFIle(beam.DoFn):

    def __init__(self, bucket_name):
        self.bucket_name = bucket_name

    def start_bundle(self):
        self.client = storage.Client()

    def process(self, mylist):
        df = pd.DataFrame(mylist, columns={'product_id': str, 'title':str,
        'vendor': str, 'product_type': str, 'updated_at': str, 
        'created_at': str, 'option_ids': str})

        bucket = self.client.get_bucket(self.bucket_name)
        bucket
          .blob(f"csv_exports.csv")
          .upload_from_string(df.to_csv(index=False), 'text/csv')

finally write the actual data flow pipeline and save the file as DataflowTest.py

class DataflowOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument('--input_path', type=str, default='gs://steven_dataflow_temp/input.json')
        parser.add_argument('--output_bucket', type=str, default='steven_dataflow_temp')
def run(argv=None):
    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    dataflow_options = pipeline_options.view_as(DataflowOptions)

    with beam.Pipeline(options=pipeline_options) as pipeline:
        (pipeline
         | 'Start' >> beam.Create([None])
         | 'Read JSON' >> beam.ParDo(ReadFile(dataflow_options.input_path))
         | 'Write CSV' >> beam.ParDo(WriteCSVFIle(dataflow_options.output_bucket))
         )
if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()f

完整代码：注意原作者代码由于程序库版本问题已经失效，下面代码经过调试可运行

import argparse
import json
import logging

import apache_beam as beam
import pandas as pd
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import storage
from smart_open import open


class ReadFile(beam.DoFn):

    def __init__(self, input_path):
        self.input_path = input_path

    def start_bundle(self):
        self.client = storage.Client()

    def process(self, something):
        clear_data = []
        with open(self.input_path) as fin:
           # for line in fin:
                #data = json.loads(line)
                data = json.load(fin)
                product = data.get('product')

                if product and product.get('id'):
                    product_id = str(product.get('id'))
                    title = product.get('title')
                    vendor = product.get('vendor')
                    product_type = product.get('product_type')
                    updated_at = product.get('updated_at')
                    created_at = product.get('created_at')
                    product_options = product.get('options')

                    option_ids = []
                    if product_options:
                        for option in product_options:
                            option_ids.append(option.get('id'))

                    clear_data.append([product_id,title, vendor, product_type, updated_at, created_at, option_ids])

        yield clear_data


class WriteCSVFIle(beam.DoFn):

    def __init__(self, bucket_name):
        self.bucket_name = bucket_name

    def start_bundle(self):
        self.client = storage.Client()

    def process(self, mylist):
        df = pd.DataFrame(mylist, columns={'product_id': str,'title':str, 'vendor': str, 'product_type': str, 'updated_at': str, 'created_at': str, 'option_ids': str})

        bucket = self.client.get_bucket(self.bucket_name)
        bucket.blob(f"csv_exports.csv").upload_from_string(df.to_csv(index=False), 'text/csv')


class DataflowOptions(PipelineOptions):

    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_argument('--input_path', type=str, default='gs://steven_dataflow_temp/input.json')
        parser.add_argument('--output_bucket', type=str, default='steven_dataflow_temp')


def run(argv=None):
    parser = argparse.ArgumentParser()
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    dataflow_options = pipeline_options.view_as(DataflowOptions)

    with beam.Pipeline(options=pipeline_options) as pipeline:
        (pipeline
         | 'Start' >> beam.Create([None])
         | 'Read JSON' >> beam.ParDo(ReadFile(dataflow_options.input_path))
         | 'Write CSV' >> beam.ParDo(WriteCSVFIle(dataflow_options.output_bucket))
         )


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

we also need to create a requirements.txt file with all libraries that we used in the build of pipeline.

google-cloud-storage==1.28.1
pandas==1.0.3
smart-open==2.0.0

missing libraries

It actually turns out apart from the above dependence we also need which is not mentioned in original article
first we need to make sure the virtualenv is turned on and created with venv directory

virtualenv PythonDataflow
mv *.* PythonDataflow/ #move all project files to the venv directory
source ./bin/activate # activate
deactivate # if need deactivate the venv

run the following command inside the venv directory we created just now PythonDataflow

pip install google-apitools 
pip install apache-beam

到这里才完成了安装所有需要的库而且因为是虚拟环境下所以默认的编译器就是python3.x

Before generating the template files we need to create the temp and staging folders in GCS bucket like below
在这里插入图片描述
after all set from above steps, we are ready to generate the Dataflow Template by running the following command in the terminal from the folder where we created DataflowTest.py.

# from the path of DataflowTest.py
python3 DataflowTest.py \
--runner DataflowRunner \
--project gentle-dominion-245803 \
--staging_location gs://steven_dataflow_staging/ \
--temp_location gs://steven_dataflow_temp/ \
--template_location gs://steven_dataflow_template/MyTemplate \
--requirements_file requirements.txt \
--save_main_session True

python DataflowTest.py \
--runner DataflowRunner \
--project gentle-dominion-245803 \
--staging_location gs://steven_dataflow_staging/ \
--temp_location gs://steven_dataflow_temp/ \
--template_location gs://steven_dataflow_template/MyTemplate \
--setup_file /home/zhanglingfei/Downloads/dataflow_job_steven/PythonDataflow/PythonDataflow/setup.py \
--save_main_session True

we actually use the second approach with setup.py to run with success with some outputs I have not figured out completely.
在这里插入图片描述

During running the staging and temp location of GCS bucket will be used from Dataflow engine. template location is the bucket where the template is gonna be stored.
在运行期间，将通过Dataflow引擎使用GCS存储区的暂存和临时位置。模板位置是存储模板的存储桶。

成功运行后，我们可以在GCS存储桶中找到模板文件：
After running successfully, we can find the template file in GCS bucket:
在这里插入图片描述
once we have the template generated we are ready to run Dataflow Job. Select Dataflow on Google Console and then Create Job from Template give it a name “my_test_job” and then on the Dataflow Template select Custom Template. Browse and select the template from the path of GCS bucket. Also provisioning the path of Temp folder parth uri.
生成模板后，就可以运行数据流作业了。在Google控制台上选择“数据流”，然后“从模板创建作业”为其命名为“ my_test_job”，然后在“数据流模板”上选择“自定义模板”。浏览并从GCS存储桶的路径中选择模板。还提供Temp文件夹parth uri的路径。
在这里插入图片描述
Click run and after running we will see the flow in the dataflow execution console:
and also the CSV file in the temp folder
starting status while the flow is being analyzed.
runing status:

completed running status:

在这里插入图片描述
The full project code is available from github: https://github.com/aFrag/PythonDataflow

张凌飞

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
利用谷歌云DATAFLOW产品把BUCKET中的JSON文件转化为CSV

在这篇文章中我们把一个JSON文件要转化为CSV文件，使用工具是GCP DATAFLOW和PYTHON。首先我们需要创建一个服务账户SERVICE ACCOUNT，这个服务账号代表程序运行我们起名叫做“DATAFLOW WORKER“角色，还需要输出成一个JSON文件备用。The JSON file from the created Service Account is import to operate from local terminal rather than cloud Shell. S.
复制链接

扫一扫

专栏目录