python数据处理应做什么检查结果_用Python设计数据质量检查应用程序

最新推荐文章于 2023-10-30 16:58:59 发布

weixin_39734020

最新推荐文章于 2023-10-30 16:58:59 发布

阅读量190

点赞数

文章标签： python数据处理应做什么检查结果

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39734020/article/details/111417667

版权

我正在开发一个应用程序，它对输入文件执行数据质量检查，并根据数据中报告的DQ故障来捕获计数。我使用的方法是否有意义，或者会推荐更好的方法？在

我正在尝试用Python编写一个应用程序，它将捕获数据中的DQ错误并收集计数。我本来可以用Pandas，Numpy来做这个，但是，由于数据量非常大~100gb，我决定通过Spark来实现。这是我用Python编写的第三个应用程序，所以虽然我可以用它来编写代码，但如果这真的是最好的方法，我就无能为力了。在

总而言之，我正在读取多个CSV文件并在其上创建一个Parquet文件，然后创建一个temp视图，我可以查询它来查找DQ问题。然后，我将查询的结果捕获到一个变量中，然后将其写入一个列表。此列表稍后用于编写CSV，该CSV将成为仪表板报告的输入。代码如下。在# Importing required libraries

import time,datetime

from pyspark.sql import SparkSession

# Initiating Spark Session

spark = SparkSession.builder.appName("DQ-checks").getOrCreate()

# Initializing Variables

time1 = datetime.datetime.now()

src_file_01 = r'\All kinds of data files\craigslistVehicles.csv'

target_filename = r'\All kinds of data files\craigslistVehicles.parquet'

# Read the CSV file through Spark

df = spark.read.csv(src_file_01, header="true", inferSchema="true")

# Update code to make it flexible enough to read multiple files

# Write the contents of the CSV file into a Parquet file

df.write.format("parquet").save(target_filename, mode="Overwrite")

print("created a parquet file")

# Create a temporary view over the Parquet file to query data

df2 = spark.read.parquet(target_filename)

df2.createOrReplaceTempView("craigslistVehicles")

# Create a column list from the header of the Spark View

column_list = df2.columns

print(column_list)

# Capturing time before start of the query for benchmarking

time2 = datetime.datetime.now()

result_store = []

# Iterate through all the columns and capture null counts for each column

rule_type = 'Null Check'

results={}

for column_names in column_list:

query = "Select count(*) from craigslistVehicles where {} is null".format(column_names)

# print(query)

df3 = spark.sql(query).collect()

for i in df3:

results.update(i.asDict())

res_in_num=results['count(1)']

result_store=[rule_type,column_names,res_in_num]

print (result_store)

# Next Steps - Update code to add more data quality checks based on requirement.

# Next Steps - Insert results of the queries into a spark table that can be used as a log and becomes an input for a dashboard report.

# Capturing time after end of the query for benchmarking

time3 = datetime.datetime.now()

print("Query time is.{}",time3-time2)

print("Total Job run time is.{}",time3-time1)

# Spark Session Stop

spark.stop()

目前，这是可行的。我能在一分钟内处理一个文件。在

我的问题是-

这个设计有意义吗？如果你必须这样做，你会怎么做？有什么明显的地方我可以改变，使代码更干净？在

weixin_39734020

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python数据处理应做什么检查结果_用Python设计数据质量检查应用程序

我正在开发一个应用程序，它对输入文件执行数据质量检查，并根据数据中报告的DQ故障来捕获计数。我使用的方法是否有意义，或者会推荐更好的方法？在我正在尝试用Python编写一个应用程序，它将捕获数据中的DQ错误并收集计数。我本来可以用Pandas，Numpy来做这个，但是，由于数据量非常大~100gb，我决定通过Spark来实现。这是我用Python编写的第三个应用程序，所以虽然我可以用它来编写代码，...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。