目录
代码
git clone https://github.com/castorini/pyserini.git --recurse-submodules
# pyserini/tools为submodules,https://github.com/castorini/anserini-tools
安装
https://github.com/castorini/pyserini/blob/master/docs/installation.md
pyserini依赖java环境,可用conda直接安装。安装后,java版本查看命令 "java --version"
如果只是利用开源索引做测试,使用 PyPI Installation即可。
如果自建索引,需要Development Installation。最后一步将fatjar包copy到pyserini/resources/jars/中,fatjar包有两种获取方式,
- 在 anserini 项目中编译“
mvn clean package”,保存路径为 anserini/target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar。https://github.com/castorini/anserini?tab=readme-ov-file#-installation - 直接下载 https://repo1.maven.org/maven2/io/anserini/anserini/0.38.0/anserini-0.38.0-fatjar.jar。 https://github.com/castorini/anserini/blob/master/docs/fatjar-regressions/fatjar-regressions-v0.38.0.md
使用
默认下载保存路径 ~/.cache/pyserini/
指定下载保存路径 export PYSERINI_CACHE=/path/to/cache
msmarco-passage bm25
https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md
下载数据集
mkdir collections/msmarco-passage
wget https://msmarco.z22.web.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage
# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage
tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

格式转为jsonl
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl

建立索引
python -m pyserini.index.lucene \
--collection JsonCollection \
--input collections/msmarco-passage/collection_jsonl \
--index indexes/lucene-index-msmarco-passage \
--generator DefaultLuceneDocumentGenerator \
--threads 9 \
--storePositions --storeDocvectors --storeRaw
# index 为索引保存路径
检索
python -m pyserini.search.lucene \
--index indexes/lucene-index-msmarco-passage \
--topics msmarco-passage-dev-subset \
--output runs/run.msmarco-passage.bm25tuned.txt \
--output-format msmarco \
--hits 1000 \
--bm25 --k1 0.82 --b 0.68 \
--threads 4 --batch-size 16
计算指标
python -m pyserini.eval.msmarco_passage_eval \
tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
runs/run.msmarco-passage.bm25tuned.txt
#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################
其他指标计算,需要建立trec格式索引,qrels转为trec格式
https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md#evaluation
BEIR bm25
数据下载
wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-corpus.tar -P collections/ tar xvf collections/beir-v1.0.0-corpus.tar -C collections/
anserini教程
安装 GitHub - castorini/anserini: Anserini is a Lucene toolkit for reproducible information retrieval research
类似pyserini development install,fatjar包copy到anserini/target/目录
# build index->search->metric pipeline
python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-scifact.flat
# only print command for each step
python src/main/python/run_regression.py --index --verify --search --regression beir-v1.0.0-scifact.flat --dry-run
pyserini教程
# build index
python -m pyserini.index.lucene \
--collection BeirFlatCollection \
--input collections/beir-v1.0.0/corpus/scifact/ \
--generator DefaultLuceneDocumentGenerator \
--index indexes/lucene-inverted.beir-v1.0.0-scifact.flat/ \
--threads 1 \
--storePositions --storeDocvectors --storeRaw
# search
python -m pyserini.search.lucene \
--index indexes/lucene-inverted.beir-v1.0.0-scifact.flat/ \
--topics tools/topics-and-qrels/topics.beir-v1.0.0-scifact.test.tsv.gz \
--output runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25 \
--bm25 --hits 1000
# metric
python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-scifact.test.txt runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25
python -m pyserini.eval.trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.beir-v1.0.0-scifact.test.txt runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25
python -m pyserini.eval.trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.beir-v1.0.0-scifact.test.txt runs/run.inverted.beir-v1.0.0-scifact.flat.test.bm25
Doc2Query
anserini/docs/experiments-doc2query.md at master · castorini/anserini · GitHub
# download expanded queries, 第i行是collection.tsv中第i个doc的expaned queries
wget https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/base/msmarco-passage-pred-test_topk10.tar.gz -P collections/msmarco-passage
tar -xzvf collections/msmarco-passage/msmarco-passage-pred-test_topk10.tar.gz -C collections/msmarco-passage
python tools/scripts/msmarco/augment_collection_with_predictions.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl_expanded_topk10 \
--predictions collections/msmarco-passage/pred-test_topk10.txt --stride 1
python -m pyserini.index.lucene \
--collection JsonCollection \
--generator DefaultLuceneDocumentGenerator --threads 9 \
--input collections/msmarco-passage/collection_jsonl_expanded_topk10 \
--index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
--storePositions --storeDocvectors --storeRaw
python -m pyserini.search.lucene \
--index indexes/msmarco-passage/lucene-index-msmarco-expanded-topk10 \
--topics msmarco-passage-dev-subset \
--output runs/run.msmarco-passage-expanded-topk10.bm25tuned.trec \
--hits 1000 \
--bm25 --k1 0.9 --b 0.4 \
--threads 16 --batch-size 16
python -m pyserini.eval.trec_eval -c -M 10 -m recip_rank msmarco-passage-dev-subset runs/run.msmarco-passage-expanded-topk10.bm25tuned.trec
python -m pyserini.eval.trec_eval -c -m recall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage-expanded-topk10.bm25tuned.trec
自定义数据集
pyserini/docs/usage-index.md at master · castorini/pyserini · GitHub

460

被折叠的 条评论
为什么被折叠?



