使用转换器在自定义数据集上构建更快,更准确的搜索引擎

介绍(Introduction)

In this article, we will build a search engine on a huge corpus of custom dataset, which will not only retrieve the search results based on the query/questions but also give us a 1000 words context around the response.

在本文中,我们将在庞大的自定义数据集上构建搜索引擎,该引擎不仅会根据查询来检索搜索结果,还会在响应周围提供1000个单词的上下文

And all of that a lot faster and more accurate using transformers🤗

使用transformers🤗 ,所有这些都可以更快,更准确地transformers🤗

Example —

示例-

Question: What is the impact of coronavirus on pregnant women?Answer: pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this populationResearch Paper: COVID 19 in babies: Knowledge for neonatal careContext: The disease manifests with a spectrum of symptoms ranging from mild upper respiratory tract infection to severe pneumonitis, acute respiratory distress syndrome (ARDS) and death.Relatively few cases have occurred in children and neonates who seem to have a more favourable clinical course than other age groups (De Rose et al. 2020) . While not initially identified as a population at risk, pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this population (Alfaraj et al. 2019) .Moreover, the associated policies developed as a result of the pandemic relating to social distancing and prevention of cross infection have led to important considerations specific to the field of maternal and neonatal health, and a necessity to consider unintended consequences for both the mother and baby (Buekens et al. 2020)

I have published a Kaggle notebook here.

在这里发布了Kaggle笔记本

To achieve this, we will need:

为此,我们将需要:

  • A corpus of data

    数据语料库
  • Transformers library to build QA model

    Transformers库以建立质量检查模型

  • and Finally, Haystack library to scale QA model to thousands of documents and build a search engine.

    最后, Haystack库将质量检查模型扩展到成千上万个文档,并构建了搜索引擎。

Let’s start —

开始吧 -

数据 (Data)

For this article, we will use Kaggle’s COVID-19 Open Research Dataset Challenge (CORD-19).

对于本文,我们将使用Kaggle的COVID-19开放研究数据集挑战(CORD-19)

CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

CORD-19的资源超过200,000篇学术文章,其中包括超过100,000篇有关COVID-19,SARS-CoV-2和相关冠状病毒的全文。 该免费可用的数据集提供给全球研究界,以应用自然语言处理和其他AI技术的最新进展来产生新见解,以支持正在进行的与这种传染病的斗争。

This dataset is ideal for building document retrieval system as it has full research paper content in text format. Columns like

该数据集具有文本格式的完整研究论文内容,是构建文档检索系统的理想选择。 像列

  • paper_id: Unique identifier of research paper

    paper_id :研究论文的唯一标识符

  • title: title of research paper

    title :研究论文标题

  • abstract: Bried summary of the research paper

    abstract :研究论文摘要

  • full_text: Full text/content of the research paper

    full_text :研究论文的全文/内容

are of our interest.

是我们的兴趣所在。

In Kaggle Folder Structure — There are 2 directories — pmc_json and pdf_json - which contains the data in the json format. We will take 25,000 articles from pmc_json directory and 25000 articles from pdf_json - So, a total of 50,000 research articles to build our search engine.

Kaggle文件夹结构中-有2个目录pmc_jsonpdf_json包含json格式的数据。 我们将采取25,000文章pmc_json目录和25000篇文章从pdf_json -所以,一共有50000篇研究论文来构建我们的搜索引擎。

We will extract paper_id, title, abstract, full_text and put it in an easy to use pandas.DataFrame.

我们将提取paper_idtitleabstractfull_text并将其放在易于使用的pandas.DataFrame

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值