使用转换器在自定义数据集上构建更快，更准确的搜索引擎

最新推荐文章于 2024-06-01 23:52:45 发布

weixin_26721705

最新推荐文章于 2024-06-01 23:52:45 发布

阅读量260

点赞数

文章标签： python

原文链接：https://medium.com/analytics-vidhya/building-a-faster-and-accurate-search-engine-on-custom-dataset-with-transformers-d1277bedff3d

版权

介绍(Introduction)

In this article, we will build a search engine on a huge corpus of custom dataset, which will not only retrieve the search results based on the query/questions but also give us a 1000 words context around the response.

在本文中，我们将在庞大的自定义数据集上构建搜索引擎，该引擎不仅会根据查询来检索搜索结果，还会在响应周围提供1000个单词的上下文。

And all of that a lot faster and more accurate using transformers🤗

使用transformers🤗 ，所有这些都可以更快，更准确地transformers🤗

Example —

示例-

Question: What is the impact of coronavirus on pregnant women?Answer: pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this populationResearch Paper: COVID 19 in babies: Knowledge for neonatal careContext: The disease manifests with a spectrum of symptoms ranging from mild upper respiratory tract infection to severe pneumonitis, acute respiratory distress syndrome (ARDS) and death.Relatively few cases have occurred in children and neonates who seem to have a more favourable clinical course than other age groups (De Rose et al. 2020) . While not initially identified as a population at risk, pregnant woman may be more vulnerable to severe infection (Favre et al. 2020 ) and evidence from previous viral outbreaks suggests a higher risk of unfavourable maternal and neonatal outcomes in this population (Alfaraj et al. 2019) .Moreover, the associated policies developed as a result of the pandemic relating to social distancing and prevention of cross infection have led to important considerations specific to the field of maternal and neonatal health, and a necessity to consider unintended consequences for both the mother and baby (Buekens et al. 2020)

I have published a Kaggle notebook here.

我在这里发布了Kaggle笔记本。

To achieve this, we will need:

为此，我们将需要：

A corpus of data
数据语料库
Transformers library to build QA model
Transformers库以建立质量检查模型
and Finally, Haystack library to scale QA model to thousands of documents and build a search engine.
最后， Haystack库将质量检查模型扩展到成千上万个文档，并构建了搜索引擎。

Let’s start —

开始吧 -

数据 (Data)

For this article, we will use Kaggle’s COVID-19 Open Research Dataset Challenge (CORD-19).

对于本文，我们将使用Kaggle的COVID-19开放研究数据集挑战(CORD-19) 。

CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease.

CORD-19的资源超过200,000篇学术文章，其中包括超过100,000篇有关COVID-19，SARS-CoV-2和相关冠状病毒的全文。该免费可用的数据集提供给全球研究界，以应用自然语言处理和其他AI技术的最新进展来产生新见解，以支持正在进行的与这种传染病的斗争。

This dataset is ideal for building document retrieval system as it has full research paper content in text format. Columns like

该数据集具有文本格式的完整研究论文内容，是构建文档检索系统的理想选择。像列

paper_id: Unique identifier of research paper
paper_id ：研究论文的唯一标识符
title: title of research paper
title ：研究论文标题
abstract: Bried summary of the research paper
abstract ：研究论文摘要
full_text: Full text/content of the research paper
full_text ：研究论文的全文/内容

are of our interest.

是我们的兴趣所在。

In Kaggle Folder Structure — There are 2 directories — pmc_json and pdf_json - which contains the data in the json format. We will take 25,000 articles from pmc_json directory and 25000 articles from pdf_json - So, a total of 50,000 research articles to build our search engine.

在Kaggle文件夹结构中-有2个目录pmc_json和pdf_json包含json格式的数据。我们将采取25,000文章pmc_json目录和25000篇文章从pdf_json -所以，一共有50000篇研究论文来构建我们的搜索引擎。

We will extract paper_id, title, abstract, full_text and put it in an easy to use pandas.DataFrame.

我们将提取paper_id ， title ， abstract ， full_text并将其放在易于使用的pandas.DataFrame 。

最低0.47元/天解锁文章

weixin_26721705

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
使用转换器在自定义数据集上构建更快，更准确的搜索引擎

介绍(Introduction)In this article, we will build a search engine on a huge corpus of custom dataset, which will not only retrieve the search results based on the query/questions but also give us a 1000...
复制链接

扫一扫