数据分析 数据科学_数据科学中的数据分析

数据分析 数据科学

资料剖析 (Data Profiling)

Data Profiling is a method of examining data from an existing supply and summarizing info this data. Your profile data to work out the accuracy, completeness, and validity of your data. Information identification is in dire straits several reasons, however, it's most typically a part of serving to work out information quality as an element of a bigger project. Commonly, Data Profiling is combined with an ETL (Extract, Transform, and Load) method to maneuver data from one system to a different. Once done properly, ETL and Data Profiling is combined to cleanse, enrich, and move quality information to a target location.

数据分析是一种检查来自现有供应商的数据并汇总此数据信息的方法。 您的个人资料数据可以计算出数据的准确性,完整性和有效性。 信息识别陷入困境的原因有很多,但是,它通常是确定信息质量的一部分,这是大型项目的一个组成部分。 通常,数据分析与ETL(提取,转换和加载)方法结合使用,可以将数据从一个系统转移到另一个系统。 一旦正确完成,ETL和数据分析将结合起来,以清理,丰富质量信息并将其移动到目标位置。

For example, you may need to perform data profiling once migrating from a gift system to a brand new system. Data Profiling will facilitate establish data quality problems that require to be handled within the code after you move data into your new system Or you may need to perform data profiling as you progress data to a data warehouse for business analytics. Typically once data is captive to a data warehouse, ETL tools are accustomed to moving the Data. Data profiling is useful in characteristic what data quality problems should be fastened within the supply, and what data quality problems are fastened throughout the ETL method.

例如,从礼品系统迁移到全新系统后,您可能需要执行数据分析。 数据剖析有助于建立数据质量问题,这些问题需要在将数据移至新系统中之后在代码中进行处理,或者在将数据前进到数据仓库进行业务分析时可能需要执行数据剖析。 通常,一旦数据被捕获到数据仓库中,ETL工具就会习惯于移动数据。 数据概要分析有助于确定应在供应中解决哪些数据质量问题以及在整个ETL方法中解决哪些数据质量问题。

为什么要分析资料? (Why profile data?)

Data profiling permits you to answer the subsequent questions on your data:

数据分析使您可以回答有关数据的后续问题:

  • Is the data complete? Are there a blank or no values?

    数据是否完整? 是否有空白或没有值?

  • Is this data unique? How many distinct values are there? Is that the data duplicated?

    此数据是否唯一? 有多少个不同的值? 数据是否重复?

  • Are there abnormal patterns in your data? What's the distribution of patterns in your data?

    您的数据中是否存在异常模式? 数据中模式的分布是什么?

  • Are these the patterns I expect?

    这些是我期望的模式吗?

  • What varies values exist and are they expected? What are the utmost, minimum, and average values for given data? Are these the ranges I expect?

    存在哪些不同的值,它们是预期的吗? 给定数据的最大,最小和平均值是多少? 这些是我期望的范围吗?

Answering these queries helps you make sure that you're maintaining quality data, that — firms are progressively realizing — is that the cornerstone of a thriving business.

回答这些查询有助于确保您正在维护质量数据(企业正在逐步实现),这是业务蓬勃发展的基石。

一个配置文件如何数据? (How does one profile data?)

Data profiling is performed in several ways that, however, there are roughly 3 base ways accustomed to analyze the info.

数据分析以几种方式执行,但是,大约有3种基本方式习惯于分析信息。

Column profiling counts the number of times each price seems among every column during a table. This methodology helps to uncover the patterns among your data.

列分析计算表中每个列中每个价格出现的次数。 这种方法有助于发现数据中的模式。

Cross-column profiling appearance across columns to perform key and dependency analysis. Key analysis scans collections of values during a table to find a possible primary key. Dependency analysis determines the dependent relationships among a data set. Together, these analyses verify the relationships and dependencies among a table.

跨列的跨列分析外观,以执行键和依赖关系分析。 键分析在表期间扫描值的集合,以查找可能的主键。 依赖性分析确定数据集之间的依赖性关系。 这些分析共同验证了表之间的关系和依赖性。

Cross-table profiling appearance across tables to spot potential foreign keys. It additionally attempts to work out the similarities and variations in syntax and data varieties between tables to determine that data may well be redundant and which could be mapped along.

跨表的跨表分析外观可发现潜在的外键。 此外,它尝试找出表之间语法和数据种类的相似性和变化形式,以确定数据可能完全是冗余的并且可以沿数据映射。

Rule validation is usually thought of as the ultimate step in data profiling. This can be a proactive step of adding rules that check for the correctness and integrity of the info that's entered into the system.

通常将规则验证视为数据概要分析的最终步骤。 这可以是添加规则的主动步骤,该规则将检查输入到系统中的信息的正确性和完整性。

These different ways could also be performed manually by an analyst, or they'll be performed by a service that will alter these queries.

这些不同的方式也可以由分析师手动执行,或者由将更改这些查询的服务来执行。

数据分析挑战 (Data profiling challenges)

Data profiling is commonly troublesome because of the sheer volume of data you'll get to profile. This can be very true if you're gazing at a gift system. A gift system might need years of older data with thousands of errors. Consultants advocate that you simply phase your data as a section of your data profiling method so you'll be able to see the forest for the trees.

数据分析通常很麻烦,因为您将要分析的数据量很大。 如果您盯着礼物系统,这可能是非常正确的。 礼物系统可能需要多年的旧数据,并且有数千个错误。 顾问们提倡您只需将数据作为数据分析方法的一部分进行分阶段操作,就可以看到树木的森林。

If you manually perform your data profiling, you should have the skill to run various queries and sift through the results to achieve meaningful insights regarding your data, which might eat up precious resources. Additionally, you may doubtless solely be ready to check a set of your overall data as a result of it's too long to travel through the complete data set.

如果您手动执行数据分析,则您应该具有运行各种查询并筛选结果的技巧,以获取有关数据的有意义的见解,这可能会消耗宝贵的资源。 此外,由于时间太长,无法遍历完整的数据集,因此毫无疑问,您可能只准备检查一组整体数据。

翻译自: https://www.includehelp.com/data-science/data-profiling.aspx

数据分析 数据科学

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值