pyserini安装&使用

目录

安装

使用

msmarco-passage bm25

BEIR bm25

anserini教程 

pyserini教程

Doc2Query

自定义数据集


代码

git clone https://github.com/castorini/pyserini.git --recurse-submodules
# pyserini/tools为submodules,https://github.com/castorini/anserini-tools

安装

https://github.com/castorini/pyserini/blob/master/docs/installation.md 
pyserini依赖java环境,可用conda直接安装。安装后,java版本查看命令 "java --version"
如果只是利用开源索引做测试,使用 PyPI Installation即可。
如果自建索引,需要Development Installation。最后一步将fatjar包copy到pyserini/resources/jars/中,fatjar包有两种获取方式,

  1. 在 anserini 项目中编译“mvn clean package”,保存路径为 anserini/target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar。https://github.com/castorini/anserini?tab=readme-ov-file#-installation
  2. 直接下载 https://repo1.maven.org/maven2/io/anserini/anserini/0.38.0/anserini-0.38.0-fatjar.jar。 https://github.com/castorini/anserini/blob/master/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md 

使用

默认下载保存路径  ~/.cache/pyserini/
指定下载保存路径 export PYSERINI_CACHE=/path/to/cache

msmarco-passage bm25

https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md 

下载数据集

mkdir collections/msmarco-passage

wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

格式转为jsonl

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

建立索引

python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input collections/msmarco-passage/collection_jsonl \
  --index indexes/lucene-index-msmarco-passage \
  --generator DefaultLuceneDocumentGenerator \
  --threads 9 \
  --storePositions --storeDocvectors --storeRaw
# index 为索引保存路径

检索

python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.txt \
  --output-format msmarco \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68 \
  --threads 4 --batch-size 16

计算指标

python -m pyserini.eval.msmarco_passage_eval \
   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   runs/run.msmarco-passage.bm25tuned.txt

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

其他指标计算,需要建立trec格式索引,qrels转为trec格式

https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md#evaluation

BEIR bm25

数据下载

wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-corpus.tar -P collections/
tar xvf collections/beir-v1.0.0-corpus.tar -C collections/

anserini教程 

安装 GitHub - castorini/anserini: Anserini is a Lucene toolkit for reproducible information retrieval research
类似pyserini development install,fatjar包copy到anserini/target/目录

运行 anserini/docs/regressions/regressions-beir-v1.0.0-scifact.flat.md at master · castorini/anserini · GitHub

https://github.com/castorini/pyserini/blob/master/pyserini/resources/index-metadata/lucene-inverted.beir-v1.0.0-flat.20221116.505594.README.md

# build index->search->metric pipeline
python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-scifact.flat

# only print command for each step
python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-scifact.flat --dry-run

pyserini教程

# build index
python -m pyserini.index.lucene  \
--collection BeirFlatCollection \
--input collections/beir-v1.0.0/corpus/scifact/ \
--generator DefaultLuceneDocumentGenerator \
--index indexes/lucene-inverted.beir-v1.0.0-scifact.flat/ \
--threads 1 \
--storePositions --storeDocvectors --storeRaw

# search
python -m pyserini.search.lucene \
--index indexes/lucene-inverted.beir-v1.0.0-scifact.flat/ \
--topics tools/topics-and-qrels/topics.beir-v1.0.0-scifact.test.tsv.gz \
--output runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25 \
--bm25  --hits 1000

# metric
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-scifact.test.txt runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25
python -m pyserini.eval.trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-scifact.test.txt runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25
python -m pyserini.eval.trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-scifact.test.txt runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25

Doc2Query

anserini/docs/experiments-doc2query.md at master · castorini/anserini · GitHub

# download expanded queries, 第i行是collection.tsv中第i个doc的expaned queries
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/msmarco-passage-pred-test_topk10.tar.gz -P collections/msmarco-passage
tar -xzvf collections/msmarco-passage/msmarco-passage-pred-test_topk10.tar.gz -C collections/msmarco-passage

python tools/scripts/msmarco/augment_collection_with_predictions.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl_expanded_topk10 \
 --predictions collections/msmarco-passage/pred-test_topk10.txt --stride 1

python -m pyserini.index.lucene \
 --collection JsonCollection \
 --generator DefaultLuceneDocumentGenerator --threads 9 \
 --input collections/msmarco-passage/collection_jsonl_expanded_topk10 \
 --index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
 --storePositions --storeDocvectors --storeRaw

python -m pyserini.search.lucene \
  --index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage-expanded-topk10.bm25tuned.trec \
  --hits 1000 \
  --bm25 --k1 0.9 --b 0.4 \
  --threads 16 --batch-size 16

python -m pyserini.eval.trec_eval -c -M 10 -m recip_rank msmarco-passage-dev-subset runs/run.msmarco-passage-expanded-topk10.bm25tuned.trec
python -m pyserini.eval.trec_eval -c -m recall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage-expanded-topk10.bm25tuned.trec

自定义数据集

pyserini/docs/usage-index.md at master · castorini/pyserini · GitHub

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值