HuggingFace - Evaluate

在这里插入图片描述



关于 Evaluate

A library for easily evaluating machine learning models and datasets.
Visit the 🤗 Evaluate organization for a full list of available metrics.
Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.


种类:
https://huggingface.co/docs/evaluate/choosing_a_metric


安装 evaluate

pip3 install evaluate

源码安装

git clone https://github.com/huggingface/evaluate.git
cd evaluate
pip install -e .

安装后检查

python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"

使用

相关文档:https://huggingface.co/docs/evaluate/a_quick_tour


以 accuracy 为例:

accuracy 见: https://huggingface.co/spaces/evaluate-metric/accuracy

>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
>>> print(results)
{'accuracy': 1.0} 


查看 accuracy_metric

>>> accuracy_metric
EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        ...
""", stored examples: 0)

>>> accuracy_metric.description   # 查看描述 

查看所有 evaluation modules

>>> evaluate.list_evaluation_modules("metric")
['lvwerra/test', 'precision', 'code_eval', 'roc_auc', 'cuad', 'xnli', 'rouge', 'pearsonr', 'mse', 'super_glue', 'comet', 'cer', 'sacrebleu', 'mahalanobis', 'wer', 'competition_math', 'f1', 'recall', 'coval', 'mauve', 'xtreme_s', 'bleurt', 'ter', 'accuracy', 'exact_match', 'indic_glue', 'spearmanr', 'mae', 'squad', 'chrf', 'glue', 'perplexity', 'mean_iou', 'squad_v2', 'meteor', 'bleu', 'wiki_split', 'sari', 'frugalscore', 'google_bleu', 'bertscore', 'matthews_correlation', 'seqeval', 'trec_eval', 'rl_reliability', 'jordyvl/ece', 'angelina-wang/directional_bias_amplification', 'cpllab/syntaxgym', 'lvwerra/bary_score', 'kaggle/amex', 'kaggle/ai4code', 'hack/test_metric', 'yzha/ctc_eval', 'codeparrot/apps_metric', 'mfumanelli/geometric_mean', 'daiyizheng/valid', 'poseval', 'erntkn/dice_coefficient', 'mgfrantz/roc_auc_macro', 'Vlasta/pr_auc', 'gorkaartola/metric_for_tp_fp_samples', 'idsedykh/metric', 'idsedykh/codebleu2', 'idsedykh/codebleu', 'idsedykh/megaglue', 'kasmith/woodscore', 'cakiki/ndcg', 'brier_score', 'Vertaix/vendiscore', 'GMFTBY/dailydialogevaluate', 'GMFTBY/dailydialog_evaluate', 'jzm-mailchimp/joshs_second_test_metric', 'ola13/precision_at_k', 'yulong-me/yl_metric', 'abidlabs/mean_iou', 'abidlabs/mean_iou2', 'KevinSpaghetti/accuracyk', 'Felipehonorato/my_metric', 'NimaBoscarino/weat', 'ronaldahmed/nwentfaithfulness', 'Viona/infolm', 'kyokote/my_metric2', 'kashif/mape', 'Ochiroo/rouge_mn', 'giulio98/code_eval_outputs', 'leslyarun/fbeta_score', 'giulio98/codebleu', 'anz2/iliauniiccocrevaluation', 'zbeloki/m2', 'xu1998hz/sescore', 'mase', 'mape', 'smape', 'dvitel/codebleu', 'NCSOFT/harim_plus', 'JP-SystemsX/nDCG', 'sportlosos/sescore', 'Drunper/metrica_tesi', 'jpxkqx/peak_signal_to_noise_ratio', 'jpxkqx/signal_to_reconstrution_error', 'hpi-dhc/FairEval', 'nist_mt', 'lvwerra/accuracy_score', 'character', 'charcut_mt', 'ybelkada/cocoevaluate', 'harshhpareek/bertscore', 'posicube/mean_reciprocal_rank', 'bstrai/classification_report', 'omidf/squad_precision_recall', 'Josh98/nl2bash_m', 'BucketHeadP65/confusion_matrix', 'BucketHeadP65/roc_curve', 'yonting/average_precision_score', 'transZ/test_parascore']

加载指定类型

word_length = evaluate.load("word_length", module_type="measurement")

加载 community module

element_count = evaluate.load("lvwerra/element_count", module_type="measurement")

选择 metric

源自:https://huggingface.co/docs/evaluate/choosing_a_metric

  • Generic metrics
  • Task-specific metrics
  • Dataset-specific metrics

Generic metrics

precision_metric = evaluate.load("precision")
results = precision_metric.compute(references=[0, 1], predictions=[0, 1])
print(results)
{'precision': 1.0}

Task-specific metrics



Dataset-specific metrics

from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
results

Evaluator 类

https://huggingface.co/docs/evaluate/base_evaluator
https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes


使用示例

1、Text classification

Evaluate models on the Hub
from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline

data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))
task_evaluator = evaluator("text-classification")

# 1. Pass a model name or path
eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

# 2. Pass an instantiated model
model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

eval_results = task_evaluator.compute(
    model_or_pipeline=model,
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

# 3. Pass an instantiated pipeline
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)

输出

{
    'accuracy': 0.918,
    'latency_in_seconds': 0.013,
    'samples_per_second': 78.887,
    'total_time_in_seconds': 12.676
}

Evaluate multiple metrics
import evaluate

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)



输出

{
    'accuracy': 0.918,
    'f1': 0.916,
    'precision': 0.9147,
    'recall': 0.9187,
    'latency_in_seconds': 0.013,
    'samples_per_second': 78.887,
    'total_time_in_seconds': 12.676
}

2、Token Classification

Benchmarking several models
import pandas as pd
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline

models = [
    "xlm-roberta-large-finetuned-conll03-english",
    "dbmdz/bert-large-cased-finetuned-conll03-english",
    "elastic/distilbert-base-uncased-finetuned-conll03-english",
    "dbmdz/electra-large-discriminator-finetuned-conll03-english",
    "gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner",
    "philschmid/distilroberta-base-ner-conll2003",
    "Jorgeutd/albert-base-v2-finetuned-ner",
]

data = load_dataset("conll2003", split="validation").shuffle().select(1000)
task_evaluator = evaluator("token-classification")

results = []
for model in models:
    results.append(
        task_evaluator.compute(
            model_or_pipeline=model, data=data, metric="seqeval"
            )
        )

df = pd.DataFrame(results, index=models)
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]]

输出

The result is a table that looks like this:

modeloverall_f1overall_accuracytotal_time_in_secondssamples_per_secondlatency_in_seconds
Jorgeutd/albert-base-v2-finetuned-ner0.9410.9894.515221.4680.005
dbmdz/bert-large-cased-finetuned-conll03-english0.9620.88111.64885.8500.012
dbmdz/electra-large-discriminator-finetuned-conll03-english0.9650.88111.45687.2920.011
elastic/distilbert-base-uncased-finetuned-conll03-english0.9400.9892.318431.3780.002
gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner0.9470.9912.376420.873

Visualizing results
plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"])
plot.show()

在这里插入图片描述


3、Question Answering

Confidence intervals
from datasets import load_dataset
from evaluate import evaluator

task_evaluator = evaluator("question-answering")

data = load_dataset("squad", split="validation[:1000]")
eval_results = task_evaluator.compute(
    model_or_pipeline="distilbert-base-uncased-distilled-squad",
    data=data,
    metric="squad",
    strategy="bootstrap",
    n_resamples=30
)

输出

{
    'exact_match':
    {
        'confidence_interval': (79.67, 84.54),
        'score': 82.30,
        'standard_error': 1.28
    },
    'f1':
    {
        'confidence_interval': (85.30, 88.88),
        'score': 87.23,
        'standard_error': 0.97
    },
    'latency_in_seconds': 0.0085,
    'samples_per_second': 117.31,
    'total_time_in_seconds': 8.52
 }

4、Image classification

Handling large datasets
data = load_dataset("imagenet-1k", split="validation", use_auth_token=True)

pipe = pipeline(
    task="image-classification",
    model="facebook/deit-small-distilled-patch16-224"
)

task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    metric="accuracy",
    label_mapping=pipe.model.config.label2id
)

结合自定义 pipelines 使用 evaluator

he evaluator is designed to work with transformer pipelines out-of-the-box. However, in many cases you might have a model or pipeline that’s not part of the transformer ecosystem. You can still use evaluator to easily compute metrics for them. In this guide we show how to do this for a Scikit-Learn pipeline and a Spacy pipeline. Let’s start with the Scikit-Learn case.

1、Scikit-Learn

首先需要训练模型,会在 IMDB 数据集上训练一个简单的文本分类器

from datasets import load_dataset

# 下载数据
ds = load_dataset("imdb")
# 构建简单的 TF-IDF 处理器和朴素贝叶斯分类器,包装在 Pipeline 中


from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB()),
])

text_clf.fit(ds["train"]["text"], ds["train"]["label"])

根据 transformersTextClassificationPipeline中的约定,我们的管道应该是可调用的,并返回字典列表。
此外,使用 task 属性来检查管道是否与 evaluator 兼容。可以为此编写一个小型包装器类:

class ScikitEvalPipeline:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.task = "text-classification"

    def __call__(self, input_texts, **kwargs):
        return [{"label": p} for p in self.pipeline.predict(input_texts)]

pipe = ScikitEvalPipeline(text_clf)

pipeline 传递给 evaluator:

from evaluate import evaluator

task_evaluator = evaluator("text-classification")
task_evaluator.compute(pipe, ds["test"], "accuracy")

>>> {'accuracy': 0.82956}

使用“评估器” 使用任何框架中的任何模型,所需要的就是实现这个简单的包装器。

__call__方法中,您可以实现通过模型进行有效正向传递所需的所有逻辑。


2、Spacy

我们将使用 spacytextblob 项目的 polarity 功能来获得一个简单的情绪分析器。
首先,安装项目并下载资源:

pip install spacytextblob
python -m textblob.download_corpora
python -m spacy download en_core_web_sm

然后,我们可以简单地加载“nlp”管道并添加 spacytextblob 管道:

import spacy

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

使用添加了 spacytextblobpolarity 功能来获得文本的情感:

texts = ["This movie is horrible", "This movie is awesome"]
results = nlp.pipe(texts)

for txt, res in zip(texts, results):
    print(f"{text} | Polarity: {res._.blob.polarity}")

现在我们可以将它包装在一个简单的包装类中,就像前面的Scikit学习示例中一样。

它只需要返回一个带有预测标签的字典列表。如果极性大于0,我们将预测积极情绪,否则为消极情绪:

class SpacyEvalPipeline:
    def __init__(self, nlp):
        self.nlp = nlp
        self.task = "text-classification"

    def __call__(self, input_texts, **kwargs):
        results =[]
        for p in self.nlp.pipe(input_texts):
            if p._.blob.polarity>=0:
                results.append({"label": 1})
            else:
                results.append({"label": 0})
        return results

pipe = SpacyEvalPipeline(nlp)

该类与 evaluator 兼容,我们可以使用上一个示例中的相同实例以及IMDb测试集:

eval.compute(pipe, ds["test"], "accuracy")
>>> {'accuracy': 0.6914}

这将比Scikit学习示例花费更长的时间,但大约10-15分钟后,将获得评估结果。


伊织 2023-03-20(一)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值