关于 Evaluate
A library for easily evaluating machine learning models and datasets.
Visit the 🤗 Evaluate organization for a full list of available metrics.
Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.
- 文档:https://huggingface.co/docs/evaluate
- github:https://github.com/huggingface/evaluate.git
- 所有 Metrics: https://huggingface.co/metrics
种类:
https://huggingface.co/docs/evaluate/choosing_a_metric
- Metric,https://huggingface.co/evaluate-metric
- Comparison,https://huggingface.co/evaluate-comparison
- Measurement,https://huggingface.co/evaluate-measurement
安装 evaluate
pip3 install evaluate
源码安装
git clone https://github.com/huggingface/evaluate.git
cd evaluate
pip install -e .
安装后检查
python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"
使用
相关文档:https://huggingface.co/docs/evaluate/a_quick_tour
以 accuracy 为例:
accuracy 见: https://huggingface.co/spaces/evaluate-metric/accuracy
>>> accuracy_metric = evaluate.load("accuracy")
>>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
>>> print(results)
{'accuracy': 1.0}
查看 accuracy_metric
>>> accuracy_metric
EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
predictions (`list` of `int`): Predicted labels.
references (`list` of `int`): Ground truth labels.
normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
sample_weight (`list` of `float`): Sample weights Defaults to None.
Returns:
accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.
Examples:
Example 1-A simple example
>>> accuracy_metric = evaluate.load("accuracy")
...
""", stored examples: 0)
>>> accuracy_metric.description # 查看描述
查看所有 evaluation modules
>>> evaluate.list_evaluation_modules("metric")
['lvwerra/test', 'precision', 'code_eval', 'roc_auc', 'cuad', 'xnli', 'rouge', 'pearsonr', 'mse', 'super_glue', 'comet', 'cer', 'sacrebleu', 'mahalanobis', 'wer', 'competition_math', 'f1', 'recall', 'coval', 'mauve', 'xtreme_s', 'bleurt', 'ter', 'accuracy', 'exact_match', 'indic_glue', 'spearmanr', 'mae', 'squad', 'chrf', 'glue', 'perplexity', 'mean_iou', 'squad_v2', 'meteor', 'bleu', 'wiki_split', 'sari', 'frugalscore', 'google_bleu', 'bertscore', 'matthews_correlation', 'seqeval', 'trec_eval', 'rl_reliability', 'jordyvl/ece', 'angelina-wang/directional_bias_amplification', 'cpllab/syntaxgym', 'lvwerra/bary_score', 'kaggle/amex', 'kaggle/ai4code', 'hack/test_metric', 'yzha/ctc_eval', 'codeparrot/apps_metric', 'mfumanelli/geometric_mean', 'daiyizheng/valid', 'poseval', 'erntkn/dice_coefficient', 'mgfrantz/roc_auc_macro', 'Vlasta/pr_auc', 'gorkaartola/metric_for_tp_fp_samples', 'idsedykh/metric', 'idsedykh/codebleu2', 'idsedykh/codebleu', 'idsedykh/megaglue', 'kasmith/woodscore', 'cakiki/ndcg', 'brier_score', 'Vertaix/vendiscore', 'GMFTBY/dailydialogevaluate', 'GMFTBY/dailydialog_evaluate', 'jzm-mailchimp/joshs_second_test_metric', 'ola13/precision_at_k', 'yulong-me/yl_metric', 'abidlabs/mean_iou', 'abidlabs/mean_iou2', 'KevinSpaghetti/accuracyk', 'Felipehonorato/my_metric', 'NimaBoscarino/weat', 'ronaldahmed/nwentfaithfulness', 'Viona/infolm', 'kyokote/my_metric2', 'kashif/mape', 'Ochiroo/rouge_mn', 'giulio98/code_eval_outputs', 'leslyarun/fbeta_score', 'giulio98/codebleu', 'anz2/iliauniiccocrevaluation', 'zbeloki/m2', 'xu1998hz/sescore', 'mase', 'mape', 'smape', 'dvitel/codebleu', 'NCSOFT/harim_plus', 'JP-SystemsX/nDCG', 'sportlosos/sescore', 'Drunper/metrica_tesi', 'jpxkqx/peak_signal_to_noise_ratio', 'jpxkqx/signal_to_reconstrution_error', 'hpi-dhc/FairEval', 'nist_mt', 'lvwerra/accuracy_score', 'character', 'charcut_mt', 'ybelkada/cocoevaluate', 'harshhpareek/bertscore', 'posicube/mean_reciprocal_rank', 'bstrai/classification_report', 'omidf/squad_precision_recall', 'Josh98/nl2bash_m', 'BucketHeadP65/confusion_matrix', 'BucketHeadP65/roc_curve', 'yonting/average_precision_score', 'transZ/test_parascore']
加载指定类型
word_length = evaluate.load("word_length", module_type="measurement")
加载 community module
element_count = evaluate.load("lvwerra/element_count", module_type="measurement")
选择 metric
源自:https://huggingface.co/docs/evaluate/choosing_a_metric
- Generic metrics
- Task-specific metrics
- Dataset-specific metrics
Generic metrics
precision_metric = evaluate.load("precision")
results = precision_metric.compute(references=[0, 1], predictions=[0, 1])
print(results)
{'precision': 1.0}
Task-specific metrics
Dataset-specific metrics
from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
results
Evaluator 类
https://huggingface.co/docs/evaluate/base_evaluator
https://huggingface.co/docs/evaluate/v0.4.0/en/package_reference/evaluator_classes
"text-classification"
: will use the TextClassificationEvaluator."token-classification"
: will use the TokenClassificationEvaluator."question-answering"
: will use the QuestionAnsweringEvaluator."image-classification"
: will use the ImageClassificationEvaluator."text-generation"
: will use the TextGenerationEvaluator."text2text-generation"
: will use the Text2TextGenerationEvaluator."summarization"
: will use the SummarizationEvaluator."translation"
: will use the TranslationEvaluator."automatic-speech-recognition"
: will use the AutomaticSpeechRecognitionEvaluator.
使用示例
1、Text classification
Evaluate models on the Hub
from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline
data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))
task_evaluator = evaluator("text-classification")
# 1. Pass a model name or path
eval_results = task_evaluator.compute(
model_or_pipeline="lvwerra/distilbert-imdb",
data=data,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
# 2. Pass an instantiated model
model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")
eval_results = task_evaluator.compute(
model_or_pipeline=model,
data=data,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
# 3. Pass an instantiated pipeline
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")
eval_results = task_evaluator.compute(
model_or_pipeline=pipe,
data=data,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
输出
{
'accuracy': 0.918,
'latency_in_seconds': 0.013,
'samples_per_second': 78.887,
'total_time_in_seconds': 12.676
}
Evaluate multiple metrics
import evaluate
eval_results = task_evaluator.compute(
model_or_pipeline="lvwerra/distilbert-imdb",
data=data,
metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
输出
{
'accuracy': 0.918,
'f1': 0.916,
'precision': 0.9147,
'recall': 0.9187,
'latency_in_seconds': 0.013,
'samples_per_second': 78.887,
'total_time_in_seconds': 12.676
}
2、Token Classification
Benchmarking several models
import pandas as pd
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline
models = [
"xlm-roberta-large-finetuned-conll03-english",
"dbmdz/bert-large-cased-finetuned-conll03-english",
"elastic/distilbert-base-uncased-finetuned-conll03-english",
"dbmdz/electra-large-discriminator-finetuned-conll03-english",
"gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner",
"philschmid/distilroberta-base-ner-conll2003",
"Jorgeutd/albert-base-v2-finetuned-ner",
]
data = load_dataset("conll2003", split="validation").shuffle().select(1000)
task_evaluator = evaluator("token-classification")
results = []
for model in models:
results.append(
task_evaluator.compute(
model_or_pipeline=model, data=data, metric="seqeval"
)
)
df = pd.DataFrame(results, index=models)
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]]
输出
The result is a table that looks like this:
model | overall_f1 | overall_accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds |
---|---|---|---|---|---|
Jorgeutd/albert-base-v2-finetuned-ner | 0.941 | 0.989 | 4.515 | 221.468 | 0.005 |
dbmdz/bert-large-cased-finetuned-conll03-english | 0.962 | 0.881 | 11.648 | 85.850 | 0.012 |
dbmdz/electra-large-discriminator-finetuned-conll03-english | 0.965 | 0.881 | 11.456 | 87.292 | 0.011 |
elastic/distilbert-base-uncased-finetuned-conll03-english | 0.940 | 0.989 | 2.318 | 431.378 | 0.002 |
gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner | 0.947 | 0.991 | 2.376 | 420.873 |
Visualizing results
plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"])
plot.show()
3、Question Answering
Confidence intervals
from datasets import load_dataset
from evaluate import evaluator
task_evaluator = evaluator("question-answering")
data = load_dataset("squad", split="validation[:1000]")
eval_results = task_evaluator.compute(
model_or_pipeline="distilbert-base-uncased-distilled-squad",
data=data,
metric="squad",
strategy="bootstrap",
n_resamples=30
)
输出
{
'exact_match':
{
'confidence_interval': (79.67, 84.54),
'score': 82.30,
'standard_error': 1.28
},
'f1':
{
'confidence_interval': (85.30, 88.88),
'score': 87.23,
'standard_error': 0.97
},
'latency_in_seconds': 0.0085,
'samples_per_second': 117.31,
'total_time_in_seconds': 8.52
}
4、Image classification
Handling large datasets
data = load_dataset("imagenet-1k", split="validation", use_auth_token=True)
pipe = pipeline(
task="image-classification",
model="facebook/deit-small-distilled-patch16-224"
)
task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
model_or_pipeline=pipe,
data=data,
metric="accuracy",
label_mapping=pipe.model.config.label2id
)
结合自定义 pipelines 使用 evaluator
he evaluator is designed to work with transformer
pipelines out-of-the-box. However, in many cases you might have a model or pipeline that’s not part of the transformer
ecosystem. You can still use evaluator
to easily compute metrics for them. In this guide we show how to do this for a Scikit-Learn pipeline and a Spacy pipeline. Let’s start with the Scikit-Learn case.
1、Scikit-Learn
首先需要训练模型,会在 IMDB 数据集上训练一个简单的文本分类器
from datasets import load_dataset
# 下载数据
ds = load_dataset("imdb")
# 构建简单的 TF-IDF 处理器和朴素贝叶斯分类器,包装在 Pipeline 中
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(ds["train"]["text"], ds["train"]["label"])
根据 transformers
的 TextClassificationPipeline
中的约定,我们的管道应该是可调用的,并返回字典列表。
此外,使用 task
属性来检查管道是否与 evaluator
兼容。可以为此编写一个小型包装器类:
class ScikitEvalPipeline:
def __init__(self, pipeline):
self.pipeline = pipeline
self.task = "text-classification"
def __call__(self, input_texts, **kwargs):
return [{"label": p} for p in self.pipeline.predict(input_texts)]
pipe = ScikitEvalPipeline(text_clf)
将 pipeline
传递给 evaluator
:
from evaluate import evaluator
task_evaluator = evaluator("text-classification")
task_evaluator.compute(pipe, ds["test"], "accuracy")
>>> {'accuracy': 0.82956}
使用“评估器” 使用任何框架中的任何模型,所需要的就是实现这个简单的包装器。
在__call__
方法中,您可以实现通过模型进行有效正向传递所需的所有逻辑。
2、Spacy
我们将使用 spacytextblob
项目的 polarity
功能来获得一个简单的情绪分析器。
首先,安装项目并下载资源:
pip install spacytextblob
python -m textblob.download_corpora
python -m spacy download en_core_web_sm
然后,我们可以简单地加载“nlp”管道并添加 spacytextblob
管道:
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
使用添加了 spacytextblob
的 polarity
功能来获得文本的情感:
texts = ["This movie is horrible", "This movie is awesome"]
results = nlp.pipe(texts)
for txt, res in zip(texts, results):
print(f"{text} | Polarity: {res._.blob.polarity}")
现在我们可以将它包装在一个简单的包装类中,就像前面的Scikit学习示例中一样。
它只需要返回一个带有预测标签的字典列表。如果极性大于0,我们将预测积极情绪,否则为消极情绪:
class SpacyEvalPipeline:
def __init__(self, nlp):
self.nlp = nlp
self.task = "text-classification"
def __call__(self, input_texts, **kwargs):
results =[]
for p in self.nlp.pipe(input_texts):
if p._.blob.polarity>=0:
results.append({"label": 1})
else:
results.append({"label": 0})
return results
pipe = SpacyEvalPipeline(nlp)
该类与 evaluator
兼容,我们可以使用上一个示例中的相同实例以及IMDb测试集:
eval.compute(pipe, ds["test"], "accuracy")
>>> {'accuracy': 0.6914}
这将比Scikit学习示例花费更长的时间,但大约10-15分钟后,将获得评估结果。
伊织 2023-03-20(一)