介绍(Introduction)
In this article, we will build a search engine on a huge corpus of custom dataset, which will not only retrieve the search results based on the query/questions but also give us a 1000 words context around the response.
在本文中,我们将在庞大的自定义数据集上构建搜索引擎,该引擎不仅会根据查询来检索搜索结果,还会在响应周围提供1000个单词的上下文。
And all of that a lot faster and more accurate using transformers🤗
使用transformers🤗
,所有这些都可以更快,更准确地transformers🤗
Example —
示例-
Question: What is the impact of coronavirus on pregnant women?Answer: pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this populationResearch Paper: COVID 19 in babies: Knowledge for neonatal careContext: The disease manifests with a spectrum of symptoms ranging from mild upper respiratory tract infection to severe pneumonitis, acute respiratory distress syndrome (ARDS) and death.Relatively few cases have occurred in children and neonates who seem to have a more favourable clinical course than other age groups (De Rose et al. 2020) . While not initially identified as a population at risk, pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this population (Alfaraj et al. 2019) .Moreover, the associated policies developed as a result of the pandemic relating to social distancing and prevention of cross infection have led to important considerations specific to the field of maternal and neonatal health, and a necessity to consider unintended consequences for both the mother and baby (Buekens et al. 2020)
I have published a Kaggle notebook here.
我在这里发布了Kaggle笔记本。
To achieve this, we will need:
为此,我们将需要:
- A corpus of data 数据语料库
Transformers
library to build QA modelTransformers
库以建立质量检查模型and Finally,
Haystack
library to scale QA model to thousands of documents and build a search engine.最后,
Haystack
库将质量检查模型扩展到成千上万个文档,并构建了搜索引擎。
Let’s start —
开始吧 -
数据 (Data)
For this article, we will use Kaggle’s COVID-19 Open Research Dataset Challenge (CORD-19).
对于本文,我们将使用Kaggle的COVID-19开放研究数据集挑战(CORD-19) 。
CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.
CORD-19的资源超过200,000篇学术文章,其中包括超过100,000篇有关COVID-19,SARS-CoV-2和相关冠状病毒的全文。 该免费可用的数据集提供给全球研究界,以应用自然语言处理和其他AI技术的最新进展来产生新见解,以支持正在进行的与这种传染病的斗争。
This dataset is ideal for building document retrieval system as it has full research paper content in text format. Columns like
该数据集具有文本格式的完整研究论文内容,是构建文档检索系统的理想选择。 像列
paper_id
: Unique identifier of research paperpaper_id
:研究论文的唯一标识符title
: title of research papertitle
:研究论文标题abstract
: Bried summary of the research paperabstract
:研究论文摘要full_text
: Full text/content of the research paperfull_text
:研究论文的全文/内容
are of our interest.
是我们的兴趣所在。
In Kaggle Folder Structure — There are 2 directories — pmc_json
and pdf_json
- which contains the data in the json
format. We will take 25,000 articles from pmc_json
directory and 25000 articles from pdf_json
- So, a total of 50,000 research articles to build our search engine.
在Kaggle文件夹结构中-有2个目录pmc_json
和pdf_json
包含json
格式的数据。 我们将采取25,000文章pmc_json
目录和25000篇文章从pdf_json
-所以,一共有50000篇研究论文来构建我们的搜索引擎。
We will extract paper_id
, title
, abstract
, full_text
and put it in an easy to use pandas.DataFrame
.
我们将提取paper_id
, title
, abstract
, full_text
并将其放在易于使用的pandas.DataFrame
。