DeepCT 项目使用教程

魏真权

于 2024-09-13 07:44:28 发布

阅读量329

点赞数 14

本文链接：https://blog.csdn.net/gitblog_00976/article/details/142193322

版权

DeepCT 项目使用教程

DeepCT DeepCT and HDCT uses BERT to generate novel, context-aware bag-of-words term weights for documents and queries. 项目地址: https://gitcode.com/gh_mirrors/de/DeepCT

1. 项目介绍

DeepCT 是一个用于句子/段落术语加权的框架，它利用 BERT 生成新颖的、上下文感知的文档和查询的词袋术语权重。DeepCT 框架可以应用于段落时，生成可以存储在普通倒排索引中的术语权重，用于段落检索。当应用于查询文本时，DeepCT-Query 生成一个加权的词袋查询，强调查询中的关键术语。

2. 项目快速启动

环境准备

确保你已经安装了以下依赖：

Python 3
TensorFlow 1.15.0

克隆项目

git clone https://github.com/AdeDZY/DeepCT.git
cd DeepCT

训练 DeepCT 模型

设置 BERT 模型路径和训练数据路径：

export BERT_BASE_DIR=/path/to/uncased_L-12_H-768_A-12
export TRAIN_DATA_FILE=/path/to/data/marco/myalltrain_relevant_docterm_recall
export OUTPUT_DIR=/path/to/output/marco/

运行训练脚本：

python run_deepct.py \
  --task_name=marcodoc \
  --do_train=true \
  --do_eval=false \
  --do_predict=false \
  --data_dir=$TRAIN_DATA_FILE \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=16 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --recall_field=title \
  --output_dir=$OUTPUT_DIR

使用 DeepCT 进行推理

设置 BERT 模型路径和测试数据路径：

export BERT_BASE_DIR=/path/to/uncased_L-12_H-768_A-12
export INIT_CKPT=/path/to/output/marco/model.ckpt-65816
export TEST_DATA_FILE=/path/to/data/collection.tsv.1
export OUTPUT_DIR=/path/to/predictions/marco/collection_pred_1/

运行推理脚本：

python run_deepct.py \
  --task_name=marcotsvdoc \
  --do_train=false \
  --do_eval=false \
  --do_predict=true \
  --data_dir=$TEST_DATA_FILE \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$INIT_CKPT \
  --max_seq_length=128 \
  --train_batch_size=16 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=$OUTPUT_DIR