在这篇文章中我们把一个JSON文件要转化为CSV文件,使用工具是GCP DATAFLOW和PYTHON。
首先我们需要创建一个服务账户SERVICE ACCOUNT,这个服务账号代表程序运行我们起名叫做“DATAFLOW WORKER“角色,还需要输出成一个JSON文件备用。
点击下载后得到本地存储的服务账户JSON密钥文件
The JSON file from the created Service Account is import to operate from local terminal rather than cloud Shell. So after downloading the service account JSON we need to set it up as an environment variable in local terminal shell.
从创建的服务帐户导入的JSON文件可以从本地终端(而不是Cloud Shell)进行操作。 因此,下载服务帐户JSON之后,我们需要将其设置为本地终端外壳程序中的环境变量。
# export GOOGLE_APPLICATION_CREDENTIALS="path of the json from download GCP"
# path example: "/Users/afragotsis/Downloads/my_project.json"
export GOOGLE_APPLICATION_CREDENTIALS="/home/zhanglingfei/Downloads/dataflow_job_steven/PythonDataflow/gentle-dominion-245803-61d0799416fa.json"
For instance the JSON we parse having the following format:
now we can have the pipeline written.
Stage 1:
will read JSON file from GS bucket directly and using smart_open
for this.
from google.cloud import storage
from smart_open import open
class ReadFile(beam.DoFn):
def __init__(self, input_path):
self.input_path = input_path
def start_bundle(self):
self.client = storage.Client()
def process(self, something):
clear_data = []
with open(self.input_path) as fin:
#for line in fin: 可以不用循环 不过还需测试大量json为好
data = json.load(line) # 这里和原作者写法不同,经过修改有效
product = data.get('product')
if product and product.get('id'):
product_id = str(product.get('id'))
title = product.get('title')
vendor = product.get('vendor')
product_type = product.get('product_type')
updated_at = product.get('updated_at')
created_at = product.get('created_at')
product_options = product.get('options')
option_ids = []
if product_options:
for option in product_options:
option_ids.append(option.get('id'))
clear_data.append([product_id, title,vendor, product_type, updated_at, created_at, option_ids])
yield clear_data
yield clear_data
at the end of code above will pass the array we created from this stage to next stage of the pipeline.
stage 2
stage 2 will get as-input array from cleared data and transform & write them into CSV file to the output path using pandas
.
import pandas as pd
from google.cloud import storage
class WriteCSVFIle(beam.DoFn):
def __init__(self, bucket_name):
self.bucket_name = bucket_name
def start_bundle(self):
self.client = storage.Client()
def process(self, mylist):
df = pd.DataFrame(mylist, columns={'product_id': str, 'title':str,
'vendor': str, 'product_type': str, 'updated_at': str,
'created_at': str, 'option_ids': str})
bucket = self.client.get_bucket(self.bucket_name)
bucket
.blob(f"csv_exports.csv")
.upload_from_string(df.to_csv(index=False), 'text/csv')
finally write the actual data flow pipeline and save the file as DataflowTest.py
class DataflowOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_argument('--input_path', type=str, default='gs://steven_dataflow_temp/input.json')
parser.add_argument('--output_bucket', type=str, default='steven_dataflow_temp')
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
dataflow_options = pipeline_options.view_as(DataflowOptions)
with beam.Pipeline(options=pipeline_options) as pipeline:
(pipeline
| 'Start' >> beam.Create([None])
| 'Read JSON' >> beam.ParDo(ReadFile(dataflow_options.input_path))
| 'Write CSV' >> beam.ParDo(WriteCSVFIle(dataflow_options.output_bucket))
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()f
完整代码:注意原作者代码由于程序库版本问题已经失效,下面代码经过调试可运行
import argparse
import json
import logging
import apache_beam as beam
import pandas as pd
from apache_beam.options.pipeline_options import PipelineOptions
from google.cloud import storage
from smart_open import open
class ReadFile(beam.DoFn):
def __init__(self, input_path):
self.input_path = input_path
def start_bundle(self):
self.client = storage.Client()
def process(self, something):
clear_data = []
with open(self.input_path) as fin:
# for line in fin:
#data = json.loads(line)
data = json.load(fin)
product = data.get('product')
if product and product.get('id'):
product_id = str(product.get('id'))
title = product.get('title')
vendor = product.get('vendor')
product_type = product.get('product_type')
updated_at = product.get('updated_at')
created_at = product.get('created_at')
product_options = product.get('options')
option_ids = []
if product_options:
for option in product_options:
option_ids.append(option.get('id'))
clear_data.append([product_id,title, vendor, product_type, updated_at, created_at, option_ids])
yield clear_data
class WriteCSVFIle(beam.DoFn):
def __init__(self, bucket_name):
self.bucket_name = bucket_name
def start_bundle(self):
self.client = storage.Client()
def process(self, mylist):
df = pd.DataFrame(mylist, columns={'product_id': str,'title':str, 'vendor': str, 'product_type': str, 'updated_at': str, 'created_at': str, 'option_ids': str})
bucket = self.client.get_bucket(self.bucket_name)
bucket.blob(f"csv_exports.csv").upload_from_string(df.to_csv(index=False), 'text/csv')
class DataflowOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_argument('--input_path', type=str, default='gs://steven_dataflow_temp/input.json')
parser.add_argument('--output_bucket', type=str, default='steven_dataflow_temp')
def run(argv=None):
parser = argparse.ArgumentParser()
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
dataflow_options = pipeline_options.view_as(DataflowOptions)
with beam.Pipeline(options=pipeline_options) as pipeline:
(pipeline
| 'Start' >> beam.Create([None])
| 'Read JSON' >> beam.ParDo(ReadFile(dataflow_options.input_path))
| 'Write CSV' >> beam.ParDo(WriteCSVFIle(dataflow_options.output_bucket))
)
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
we also need to create a requirements.txt
file with all libraries that we used in the build of pipeline.
google-cloud-storage==1.28.1
pandas==1.0.3
smart-open==2.0.0
missing libraries
It actually turns out apart from the above dependence we also need which is not mentioned in original article
first we need to make sure the virtualenv is turned on and created with venv directory
virtualenv PythonDataflow
mv *.* PythonDataflow/ #move all project files to the venv directory
source ./bin/activate # activate
deactivate # if need deactivate the venv
run the following command inside the venv directory we created just now PythonDataflow
pip install google-apitools
pip install apache-beam
到这里才完成了安装所有需要的库而且因为是虚拟环境下所以默认的编译器就是python3.x
Before generating the template files we need to create the temp and staging folders in GCS bucket like below
after all set from above steps, we are ready to generate the Dataflow Template by running the following command in the terminal from the folder where we created DataflowTest.py
.
# from the path of DataflowTest.py
python3 DataflowTest.py \
--runner DataflowRunner \
--project gentle-dominion-245803 \
--staging_location gs://steven_dataflow_staging/ \
--temp_location gs://steven_dataflow_temp/ \
--template_location gs://steven_dataflow_template/MyTemplate \
--requirements_file requirements.txt \
--save_main_session True
python DataflowTest.py \
--runner DataflowRunner \
--project gentle-dominion-245803 \
--staging_location gs://steven_dataflow_staging/ \
--temp_location gs://steven_dataflow_temp/ \
--template_location gs://steven_dataflow_template/MyTemplate \
--setup_file /home/zhanglingfei/Downloads/dataflow_job_steven/PythonDataflow/PythonDataflow/setup.py \
--save_main_session True
we actually use the second approach with setup.py to run with success with some outputs I have not figured out completely.
During running the staging and temp location of GCS bucket will be used from Dataflow engine. template location is the bucket where the template is gonna be stored.
在运行期间,将通过Dataflow引擎使用GCS存储区的暂存和临时位置。 模板位置是存储模板的存储桶。
成功运行后,我们可以在GCS存储桶中找到模板文件:
After running successfully, we can find the template file in GCS bucket:
once we have the template generated we are ready to run Dataflow Job. Select Dataflow
on Google Console and then Create Job from Template
give it a name “my_test_job” and then on the Dataflow Template
select Custom Template
. Browse and select the template from the path of GCS bucket. Also provisioning the path of Temp folder parth uri.
生成模板后,就可以运行数据流作业了。 在Google控制台上选择“数据流”,然后“从模板创建作业”为其命名为“ my_test_job”,然后在“数据流模板”上选择“自定义模板”。 浏览并从GCS存储桶的路径中选择模板。 还提供Temp文件夹parth uri的路径。
Click run and after running we will see the flow in the dataflow execution console:
and also the CSV file in the temp folder
starting status while the flow is being analyzed.
runing status:
completed running status:
The full project code is available from github: https://github.com/aFrag/PythonDataflow