ydata-quality数据质量评估简单介绍

superY25

已于 2023-08-28 23:12:33 修改

阅读量2k

点赞数

分类专栏：人工智能文章标签：机器学习 ydata-quality 数据处理

于 2023-08-28 23:08:42 首次发布

本文链接：https://blog.csdn.net/superY_26/article/details/132550137

版权

人工智能专栏收录该内容

68 篇文章

订阅专栏

摘要

ydata-quality是一个用于数据质量的库，类似sklearn之于机器学习。通过数据多阶段流程开发评估数据质量。只要你还有可用数据，运行DataQuality(df=my_df).evaluate()代码，便可得到数据的复杂并详细的全面的评估概要。评估的角度主要有以下几个方面：

ydata-quality用法

【注】以上是对每个模块的简单介绍，具体的用法上面给了git的官方文档。

1、对数据集的整体评估

from ydata_quality import DataQuality
import pandas as pd

df = pd.read_csv(f'../datasets/transformed/census_10k.csv') # load data
dq = DataQuality(df=df) # create the main class that holds all quality modules
results = dq.evaluate() # run the tests

Warnings:
TOTAL: 5 warning(s)
Priority 1: 1 warning(s)
Priority 2: 4 warning(s)

Priority 1 - heavy impact expected:
* [DUPLICATES - DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns.
Priority 2 - usage allowed, limited human intelligibility:
* [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables.
* [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset.
* [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 10 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order.
* [DUPLICATES - EXACT DUPLICATES] Found 3 instances with exact duplicate feature values.

从报告中可以得到多个不同优先级的警告⚠️，这些警告是以上每个模块中默认的评估结果。还可以通过参数输入得到更多的结果，DataQuality()的参数如下：

DataQuality(df: DataFrame,
            label: str = None,
            random_state: Optional[int] = None,
            entities: Optional[List[Union[str, List[str]]]] = None,
            is_close: bool = False,
            ed_extensions: Optional[list] = None,
            sample: Optional[DataFrame] = None,
            model: Callable = None,
            results_json_path: str = None,
            error_tol: int = 0,
            rel_error_tol: Optional[float] = None,
            minimum_coverage: Optional[float] = 0.75,
            sensitive_features: Optional[List[str]] = None,
            dtypes: Optional[dict] = None,
            corr_th: float = 0.8,
            vif_th: float = 5,
            p_th: float = 0.05,
            plot: bool = False,
            severity: str = 'ERROR')
"""
Args:
    df (DataFrame): reference DataFrame used to run the DataQuality analysis.
    label (str, optional): [MISSINGS, LABELLING, DRIFT ANALYSIS] target feature to be predicted.
                            If not specified, LABELLING is skipped.
    random_state (int, optional): Integer seed for random reproducibility. Default is None.
        Set to None for fully random behavior, no reproducibility.
    entities: [DUPLICATES] entities relevant for duplicate analysis.
    is_close: [DUPLICATES] Pass True to use numpy.isclose instead of pandas.equals in column comparison.
    ed_extensions: [ERRONEOUS DATA] A list of user provided erroneous data values to append to defaults.
    sample: [DRIFT ANALYSIS] data against which drift is tested.
    model: [DRIFT ANALYSIS] model wrapped by ModelWrapper used to test concept drift.
    results_json (str): [EXPECTATIONS] A path to the json output from a Great Expectations validation run.
    error_tol (int): [EXPECTATIONS] Defines how many failed expectations are tolerated.
    rel_error_tol (float): [EXPECTATIONS] Defines the maximum fraction of failed expectations, \
        overrides error_tol.
    minimum_coverage (float): [EXPECTATIONS] Minimum expected fraction of DataFrame columns covered by the \
        expectation suite.
    sensitive_features (List[str]): [BIAS & FAIRNESS] features deemed as sensitive attributes
    dtypes (Optional[dict]): Maps names of the columns of the dataframe to supported dtypes. Columns not \
        specified are automatically inferred.
    corr_th (float): [DATA RELATIONS] Absolute threshold for high correlation detection. Defaults to 0.8.
    vif_th (float): [DATA RELATIONS] Variance Inflation Factor threshold for numerical independence test, \
        typically 5-10 is recommended. Defaults to 5.
    p_th (float): [DATA RELATIONS] Fraction of the right tail of the chi squared CDF defining threshold for \
        categorical independence test. Defaults to 0.05.
    plot (bool): Pass True to produce all available graphical outputs, False to suppress all graphical output.
    severity (str): Sets the logger warning threshold.
        Valid levels are: [DEBUG, INFO, WARNING, ERROR, CRITICAL]
"""

默认情况下会进行Duplicates、Missing Values、Erroneous Data、Drift Analysis、Data Relations五项分析，如果参数label不为空则会进行Labelling分析，参数sensitive_features的list长度>0则会进行Bias & Fairness分析，参数results_json_path不为空则会进行Data Expectations分析。

这里，我们只能得到数据的每个优先级下的warning大概信息，想要获取详细信息还需进一步调用get_warnings()

dq.get_warnings(test='Duplicate Columns')

2、Duplicates

该模块功能主要判断三种重复：列重复、样本重复、根据某些特征groupby之后的样本重复。
①列重复：判断dataframe中的特征列数据是否重复；如下col2和col3两列重复。

index	Col1	Col2	Col3
1	1	3	3
2	7	4	4
3	3	8	8

②样本重复：判断dataframe中的样本是否有重复的数据；如下index为1和2的两个样本重复

index	Col1	Col2	Col3
1	1	2	3
2	1	2	3
3	3	8	8

③根据某些特征groupby之后的样本重复：根据指定特征groupby，然后判断剩下特征的值是否重复；如下指定col1进行groupby，对比样本中的col2和col3是否相等，即col1=1中的index=1和2重复，而虽然index=4和5的col2、col3也相等，但是col1不在同一个group中

index	Col1	Col2	Col3
1	1	2	3
2	1	2	3
3	1	5	8
4	2	2	3
5	3	2	3
6	2	5	8

import pandas as pd
from ydata_quality.duplicates import DuplicateChecker
df = pd.read_csv('../datasets/transformed/guerry_histdata.csv')
dc = DuplicateChecker(df=df, entities=['Region', 'MainCity'])
results = dc.evaluate()