The history of Machine Comprehension (MC) has its origins along with the birth of first concepts in Artificial Intelligence (AI). The brilliant Allan Turing proposed in his famous article “Computing Machinery and Intelligence” what is now called the Turing test as a criterion of intelligence. Almost 70 years later, Question Answering (QA), a sub-domain of MC, is still one of the most difficult tasks in AI.
However, since last year, the field of Natural Language Processing (NLP) has experienced a fast evolution thanks to the development in Deep Learning research and the advent of Transfer Learning techniques. Powerful pre-trained NLP models such as OpenAI-GPT, ELMo, BERT and XLNet have been made available by the best researchers of the domain.
With such progress, several improved systems and applications to NLP tasks are expected to come out. One of such systems is the cdQA-suite
, a package developed by some colleagues and me in a partnership between Telecom ParisTech, a French engineering school, and BNP Paribas Personal Finance, a European leader in financing for individuals.
Open-domain QA vs. closed-domain QA
When we think about QA systems we should be aware of two different kinds of systems: open-domain QA (ODQA) systems and closed-domain QA (CDQA) systems.
-
Open-domain systems deal with questions about nearly anything, and can only rely on general ontologies and world knowledge. One example of such a system is DrQA, an ODQA developed by Facebook Research that uses a large base of articles from Wikipedia as its source of knowledge. As these documents are related to several different topics and subjects we can understand why this system is considered an ODQA.
-
On the other hand, closed-domain systems deal with questions under a specific domain (for example, medicine or automotive maintenance), and can exploit domain-specific knowledge by using a model that is fitted to a unique-domain database. The cdQA-suite was built to enable anyone who wants to build a closed-domain QA system easily.
cdQA-suite
cdQA
An End-To-End Closed Domain Question Answering System. - cdQA
github.com
The cdQA-suite is comprised of three blocks:
cdQA
: an easy-to-use python package to implement a QA pipelinecdQA-annotator
: a tool built to facilitate the annotation of question-answering datasets for model evaluation and fine-tuningcdQA-ui
: a user-interface that can be coupled to any website and can be connected to the back-end system.
I will explain how each module works and how you can use it to build your QA system on your own data.
cdQA
The cdQA architecture is based on two main components: the Retriever and the Reader. You can see below a schema of the system mechanism.
Mechanism of cdQA pipeline
When a question is sent to the system, the Retriever selects a list of documents in the database that are the most likely to contain the answer. It is based on the same retriever of DrQA, which creates TF-IDF features based on uni-grams and bi-grams and compute the cosine similarity betwee