How to create your own Question-Answering system easily with python-CSDN博客

本文链接：https://blog.csdn.net/truth_01/article/details/100986248

本文介绍了cdQA-suite，这是一个用于构建封闭领域问答系统（CDQA）的端到端软件套件。cdQA由Retriever和Reader两部分组成，可以利用预训练的深度学习模型如BERT进行答案检索。此外，文章还提到了cdQA-annotator工具用于数据注释，以及cdQA-ui作为用户界面，便于与网站集成。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

The history of Machine Comprehension (MC) has its origins along with the birth of first concepts in Artificial Intelligence (AI). The brilliant Allan Turing proposed in his famous article “Computing Machinery and Intelligence” what is now called the Turing test as a criterion of intelligence. Almost 70 years later, Question Answering (QA), a sub-domain of MC, is still one of the most difficult tasks in AI.

However, since last year, the field of Natural Language Processing (NLP) has experienced a fast evolution thanks to the development in Deep Learning research and the advent of Transfer Learning techniques. Powerful pre-trained NLP models such as OpenAI-GPT, ELMo, BERT and XLNet have been made available by the best researchers of the domain.

With such progress, several improved systems and applications to NLP tasks are expected to come out. One of such systems is the cdQA-suite, a package developed by some colleagues and me in a partnership between Telecom ParisTech, a French engineering school, and BNP Paribas Personal Finance, a European leader in financing for individuals.

Open-domain QA vs. closed-domain QA

When we think about QA systems we should be aware of two different kinds of systems: open-domain QA (ODQA) systems and closed-domain QA (CDQA) systems.

Open-domain systems deal with questions about nearly anything, and can only rely on general ontologies and world knowledge. One example of such a system is DrQA, an ODQA developed by Facebook Research that uses a large base of articles from Wikipedia as its source of knowledge. As these documents are related to several different topics and subjects we can understand why this system is considered an ODQA.
On the other hand, closed-domain systems deal with questions under a specific domain (for example, medicine or automotive maintenance), and can exploit domain-specific knowledge by using a model that is fitted to a unique-domain database. The cdQA-suite was built to enable anyone who wants to build a closed-domain QA system easily.

cdQA-suite

cdQA
An End-To-End Closed Domain Question Answering System. - cdQA
github.com

The cdQA-suite is comprised of three blocks:

cdQA: an easy-to-use python package to implement a QA pipeline
cdQA-annotator: a tool built to facilitate the annotation of question-answering datasets for model evaluation and fine-tuning
cdQA-ui: a user-interface that can be coupled to any website and can be connected to the back-end system.

I will explain how each module works and how you can use it to build your QA system on your own data.

cdQA

The cdQA architecture is based on two main components: the Retriever and the Reader. You can see below a schema of the system mechanism.

Mechanism of cdQA pipeline
在这里插入图片描述
When a question is sent to the system, the Retriever selects a list of documents in the database that are the most likely to contain the answer. It is based on the same retriever of DrQA, which creates TF-IDF features based on uni-grams and bi-grams and compute the cosine similarity betwee