HuggingFace - Evaluate

AI工程仔

已于 2023-04-04 16:33:52 修改

阅读量366

点赞数

分类专栏： DL深度学习文章标签： python 开发语言 HuggingFace Evaluate

于 2023-03-20 20:33:53 首次发布

原文链接：https://huggingface.co/docs/evaluate/index

版权

DL深度学习专栏收录该内容

62 篇文章 8 订阅

订阅专栏

在这里插入图片描述

关于 Evaluate

A library for easily evaluating machine learning models and datasets.
Visit the 🤗 Evaluate organization for a full list of available metrics.
Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.

文档：https://huggingface.co/docs/evaluate
github：https://github.com/huggingface/evaluate.git
所有 Metrics： https://huggingface.co/metrics

种类：
https://huggingface.co/docs/evaluate/choosing_a_metric

Metric，https://huggingface.co/evaluate-metric
Comparison，https://huggingface.co/evaluate-comparison
Measurement，https://huggingface.co/evaluate-measurement

安装 evaluate

pip3 install evaluate

源码安装

git clone https://github.com/huggingface/evaluate.git
cd evaluate
pip install -e .

安装后检查

python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"

使用

以 accuracy 为例：

accuracy 见： https://huggingface.co/spaces/evaluate-metric/accuracy

>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
>>> print(results)
{'accuracy': 1.0}

查看 accuracy_metric

>>> accuracy_metric
EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        ...
""", stored examples: 0)

>>> accuracy_metric.description   # 查看描述

查看所有 evaluation modules

>>> evaluate.list_evaluation_modules("metric")
['lvwerra/test', 'precision', 'code_eval', 'roc_auc', 'cuad', 'xnli', 'rouge', 'pearsonr', 'mse', 'super_glue', 'comet', 'cer', 'sacrebleu', 'mahalanobis', 'wer', 'competition_math', 'f1', 'recall', 'coval', 'mauve', 'xtreme_s', 'bleurt', 'ter', 'accuracy', 'exact_match', 'indic_glue', 'spearmanr', 'mae', 'squad', 'chrf', 'glue', 'perplexity', 'mean_iou', 'squad_v2', 'meteor', 'bleu', 'wiki_split', 'sari', 'frugalscore', 'google_bleu', 'bertscore', 'matthews_correlation', 'seqeval', 'trec_eval', 'rl_reliability', 'jordyvl/ece', 'angelina-wang/directional_bias_amplification', 'cpllab/syntaxgym', 'lvwerra/bary_score', 'kaggle/amex', 'kaggle/ai4code', 'hack/test_metric', 'yzha/ctc_eval', 'codeparrot/apps_metric', 'mfumanelli/geometric_mean', 'daiyizheng/valid', 'poseval', 'erntkn/dice_coefficient', 'mgfrantz/roc_auc_macro', 'Vlasta/pr_auc', 'gorkaartola/metric_for_tp_fp_samples', 'idsedykh/metric', 'idsedykh/codebleu2', 'idsedykh/codebleu', 'idsedykh/megaglue', 'kasmith/woodscore', 'cakiki/ndcg', 'brier_score', 'Vertaix/vendiscore', 'GMFTBY/dailydialogevaluate', 'GMFTBY/dailydialog_evaluate', 'jzm-mailchimp/joshs_second_test_metric', 'ola13/precision_at_k', 'yulong-me/yl_metric', 'abidlabs/mean_iou', 'abidlabs/mean_iou2', 'KevinSpaghetti/accuracyk', 'Felipehonorato/my_metric', 'NimaBoscarino/weat', 'ronaldahmed/nwentfaithfulness', 'Viona/infolm', 'kyokote/my_metric2', 'kashif/mape', 'Ochiroo/rouge_mn', 'giulio98/code_eval_outputs', 'leslyarun/fbeta_score', 'giulio98/codebleu', 'anz2/iliauniiccocrevaluation', 'zbeloki/m2', 'xu1998hz/sescore', 'mase', 'mape', 'smape', 'dvitel/codebleu', 'NCSOFT/harim_plus', 'JP-SystemsX/nDCG', 'sportlosos/sescore', 'Drunper/metrica_tesi', 'jpxkqx/peak_signal_to_noise_ratio', 'jpxkqx/signal_to_reconstrution_error', 'hpi-dhc/FairEval', 'nist_mt', 'lvwerra/accuracy_score', 'character', 'charcut_mt', 'ybelkada/cocoevaluate', 'harshhpareek/bertscore', 'posicube/mean_reciprocal_rank', 'bstrai/classification_report', 'omidf/squad_precision_recall', 'Josh98/nl2bash_m', 'BucketHeadP65/confusion_matrix', 'BucketHeadP65/roc_curve', 'yonting/average_precision_score', 'transZ/test_parascore']

加载指定类型

word_length = evaluate.load("word_length", module_type="measurement")

加载 community module

element_count = evaluate.load("lvwerra/element_count", module_type="measurement")

选择 metric

源自：https://huggingface.co/docs/evaluate/choosing_a_metric

Generic metrics
Task-specific metrics
Dataset-specific metrics

Generic metrics

precision_metric = evaluate.load("precision")
results = precision_metric.compute(references=[0, 1], predictions=[0, 1])
print(results)
{'precision': 1.0}

Task-specific metrics

Dataset-specific metrics

from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
results

Evaluator 类

https://huggingface.co/docs/evaluate/base_evaluator
https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes

"text-classification": will use the TextClassificationEvaluator.
"token-classification": will use the TokenClassificationEvaluator.
"question-answering": will use the QuestionAnsweringEvaluator.
"image-classification": will use the ImageClassificationEvaluator.
"text-generation": will use the TextGenerationEvaluator.
"text2text-generation": will use the Text2TextGenerationEvaluator.
"summarization": will use the SummarizationEvaluator.
"translation": will use the TranslationEvaluator.
"automatic-speech-recognition": will use the AutomaticSpeechRecognitionEvaluator.

使用示例

1、Text classification

Evaluate models on the Hub

from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline

data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))
task_evaluator = evaluator("text-classification")

# 1. Pass a model name or path
eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

# 2. Pass an instantiated model
model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

eval_results = task_evaluator.compute(
    model_or_pipeline=model,
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

# 3. Pass an instantiated pipeline
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)

输出

{
    'accuracy': 0.918,
    'latency_in_seconds': 0.013,
    'samples_per_second': 78.887,
    'total_time_in_seconds': 12.676
}

Evaluate multiple metrics

import evaluate

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)

输出

{
    'accuracy': 0.918,
    'f1': 0.916,
    'precision': 0.9147,
    'recall': 0.9187,
    'latency_in_seconds': 0.013,
    'samples_per_second': 78.887,
    'total_time_in_seconds': 12.676
}

2、Token Classification

Benchmarking several models

import pandas as pd
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline

models = [
    "xlm-roberta-large-finetuned-conll03-english",
    "dbmdz/bert-large-cased-finetuned-conll03-english",
    "elastic/distilbert-base-uncased-finetuned-conll03-english",
    "dbmdz/electra-large-discriminator-finetuned-conll03-english",
    "gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner",
    "philschmid/distilroberta-base-ner-conll2003",
    "Jorgeutd/albert-base-v2-finetuned-ner",
]

data = load_dataset("conll2003", split="validation").shuffle().select(1000)
task_evaluator = evaluator("token-classification")

results = []
for model in models:
    results.append(
        task_evaluator.compute(
            model_or_pipeline=model, data=data, metric="seqeval"
            )
        )

df = pd.DataFrame(results, index=models)
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]]

输出

The result is a table that looks like this:

model	overall_f1	overall_accuracy	total_time_in_seconds	samples_per_second	latency_in_seconds
Jorgeutd/albert-base-v2-finetuned-ner	0.941	0.989	4.515	221.468	0.005
dbmdz/bert-large-cased-finetuned-conll03-english	0.962	0.881	11.648	85.850	0.012
dbmdz/electra-large-discriminator-finetuned-conll03-english	0.965	0.881	11.456	87.292	0.011
elastic/distilbert-base-uncased-finetuned-conll03-english	0.940	0.989	2.318	431.378	0.002
gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner	0.947	0.991	2.376	420.873

Visualizing results

plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"])
plot.show()

在这里插入图片描述

3、Question Answering

Confidence intervals

from datasets import load_dataset
from evaluate import evaluator

task_evaluator = evaluator("question-answering")

data = load_dataset("squad", split="validation[:1000]")
eval_results = task_evaluator.compute(
    model_or_pipeline="distilbert-base-uncased-distilled-squad",
    data=data,
    metric="squad",
    strategy="bootstrap",
    n_resamples=30
)

输出

{
    'exact_match':
    {
        'confidence_interval': (79.67, 84.54),
        'score': 82.30,
        'standard_error': 1.28
    },
    'f1':
    {
        'confidence_interval': (85.30, 88.88),
        'score': 87.23,
        'standard_error': 0.97
    },
    'latency_in_seconds': 0.0085,
    'samples_per_second': 117.31,
    'total_time_in_seconds': 8.52
 }

4、Image classification

Handling large datasets

data = load_dataset("imagenet-1k", split="validation", use_auth_token=True)

pipe = pipeline(
    task="image-classification",
    model="facebook/deit-small-distilled-patch16-224"
)

task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    metric="accuracy",
    label_mapping=pipe.model.config.label2id
)

结合自定义 pipelines 使用 evaluator

he evaluator is designed to work with transformer pipelines out-of-the-box. However, in many cases you might have a model or pipeline that’s not part of the transformer ecosystem. You can still use evaluator to easily compute metrics for them. In this guide we show how to do this for a Scikit-Learn pipeline and a Spacy pipeline. Let’s start with the Scikit-Learn case.

1、Scikit-Learn

首先需要训练模型，会在 IMDB 数据集上训练一个简单的文本分类器

from datasets import load_dataset

# 下载数据
ds = load_dataset("imdb")

# 构建简单的 TF-IDF 处理器和朴素贝叶斯分类器，包装在 Pipeline 中


from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB()),
])

text_clf.fit(ds["train"]["text"], ds["train"]["label"])

根据 transformers的 TextClassificationPipeline中的约定，我们的管道应该是可调用的，并返回字典列表。
此外，使用 task 属性来检查管道是否与 evaluator 兼容。可以为此编写一个小型包装器类：

class ScikitEvalPipeline:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.task = "text-classification"

    def __call__(self, input_texts, **kwargs):
        return [{"label": p} for p in self.pipeline.predict(input_texts)]

pipe = ScikitEvalPipeline(text_clf)

将 pipeline 传递给 evaluator:

from evaluate import evaluator

task_evaluator = evaluator("text-classification")
task_evaluator.compute(pipe, ds["test"], "accuracy")

>>> {'accuracy': 0.82956}

使用“评估器” 使用任何框架中的任何模型，所需要的就是实现这个简单的包装器。

在__call__方法中，您可以实现通过模型进行有效正向传递所需的所有逻辑。

2、Spacy

我们将使用 spacytextblob 项目的 polarity 功能来获得一个简单的情绪分析器。
首先，安装项目并下载资源：

pip install spacytextblob
python -m textblob.download_corpora
python -m spacy download en_core_web_sm

然后，我们可以简单地加载“nlp”管道并添加 spacytextblob 管道：

import spacy

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

使用添加了 spacytextblob 的 polarity 功能来获得文本的情感：

texts = ["This movie is horrible", "This movie is awesome"]
results = nlp.pipe(texts)

for txt, res in zip(texts, results):
    print(f"{text} | Polarity: {res._.blob.polarity}")

现在我们可以将它包装在一个简单的包装类中，就像前面的Scikit学习示例中一样。

它只需要返回一个带有预测标签的字典列表。如果极性大于0，我们将预测积极情绪，否则为消极情绪：

class SpacyEvalPipeline:
    def __init__(self, nlp):
        self.nlp = nlp
        self.task = "text-classification"

    def __call__(self, input_texts, **kwargs):
        results =[]
        for p in self.nlp.pipe(input_texts):
            if p._.blob.polarity>=0:
                results.append({"label": 1})
            else:
                results.append({"label": 0})
        return results

pipe = SpacyEvalPipeline(nlp)

该类与 evaluator 兼容，我们可以使用上一个示例中的相同实例以及IMDb测试集：

eval.compute(pipe, ds["test"], "accuracy")
>>> {'accuracy': 0.6914}

这将比Scikit学习示例花费更长的时间，但大约10-15分钟后，将获得评估结果。

伊织 2023-03-20（一）