发行电晕论文和人工智能搜索引擎,探索covid 19研究

The COVID-19 pandemic has deeply transformed our lives in the past few months. Fortunately, governments and researchers are still working hard to fight it.

在过去的几个月中,COVID-19大流行彻底改变了我们的生活。 幸运的是,政府和研究人员仍在努力与之抗争。

In this post, I modestly bring my piece to the building and propose an AI-powered tool to help medical practitioners keep track of the latest research around COVID-19.

在这篇文章中,我谦虚地将自己的作品带到了建筑物上,并提出了一个由AI驱动的工具,以帮助医疗从业者跟踪有关COVID-19的最新研究。

Without further ado, meet Corona Papers 🎉

事不宜迟,与Corona Papers meet

In this post, I’ll:

在这篇文章中,我将:

  • Go through the main functionalities of Corona Papers and emphasize what makes it different from other search engines

    浏览Corona Papers的主要功能,并强调与其他搜索引擎的不同之处

  • Share the code of the data preprocessing and topic detection pipelines so that it can be applied in similar projects

    共享数据预处理和主题检测管道的代码,以便可以在类似项目中应用

什么是电晕纸? (What is Corona Papers?)

Corona Papers is a search engine that indexes the latest research papers about COVID-19.

Corona Papers是一个搜索引擎,可对有关COVID-19的最新研究论文进行索引。

If you’ve just watched the video, the following sections will dive into more details. If you haven't watched it yet, all you need to know is here.

如果您刚刚看过视频,则以下各节将介绍更多详细信息。 如果您还没有看过,这里就是您需要知道的。

1 — A curated list of papers and rich metadata 📄Corona Papers indexes the COVID-19 Open Research Dataset (CORD-19) provided by Kaggle. This dataset is a regularly updated resource of over 138,000 scholarly articles, including over 69,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.

1-精选的论文和丰富的元数据清单📄CoronaPapers索引了Kaggle提供的COVID-19开放研究数据集(CORD-19)。 该数据集是定期更新的资源,包含138,000余篇学术文章,其中包括69,000余篇全文,涉及COVID-19,SARS-CoV-2和相关冠状病毒。

Corona Papers also integrates additional metadata from Altmetric and Scimgo Journal to account for the online presence and academic popularity of each article. More specifically, it fetches information such as the number of shares on Facebook walls, the number of posts on Wikipedia, the number of retweets, the H-Index of the publishing journal, etc.

Corona Papers还集成了AltmetricScimgo Journal的其他元数据,以说明每篇文章的在线存在和学术知名度。 更具体地说,它获取诸如Facebook墙上的共享数量,Wikipedia上的帖子数量,转推数量,出版期刊的H-Index等信息。

The goal of integrating this metadata is to consider each paper’s impact and virality among the community, both the academic and non-academic ones.

集成此元数据的目的是考虑每篇论文在社区中的影响力和传播力,包括学术界和非学术界。

Image for post
External metadata: Altemetric and SJR
外部元数据:测高和SJR

2 —使用语言模型自动提取主题 🤖 (2 — Automatic topic extraction using a language model 🤖)

Corona Papers automatically tags each article with a relevant topic using a machine learning pipeline. This is done using CovidBERT: a state-of-the-art language model fine-tuned on medical data. With the great power of the Hugging Face library, using this model is pretty easy.

Corona Papers 使用机器学习管道自动为每篇文章添加相关主题的标签 。 这是使用CovidBERT来完成的:对医学数据进行了微调的最先进的语言模型。 借助Hugging Face库的强大功能,使用此模型非常容易。

Image for post
Hugging Face 拥抱脸

Let’s break down the topic detection pipeline for more clarity:

让我们分解主题检测管道以更加清楚:

  • Given that abstracts represent the main content of each article, they’ll be used to discover the topics instead of the full content.

    鉴于摘要代表了每篇文章的主要内容,因此将使用摘要来发现主题,而不是全部内容。

    They are first embedded using CovidBERT. This produces vectors of

    它们首先使用CovidBERT嵌入。 这产生的向量

    768 dimensions. — Note that CovidBERT embeds each abstract as a whole so that the resulting vector encapsulates the semantics of the full document.

    768尺寸。 请注意,CovidBERT 将每个摘要作为一个整体嵌入 ,因此生成的向量封装了整个文档的语义。

  • Principal Component Analysis (PCA) is performed on these vectors to reduce their dimension in order to remove redundancy and speed-up later computations. 250 components are retained to ensure 95% of the explained variance.

    对这些向量执行主成分分析(PCA)以减小其尺寸,以消除冗余并加快以后的计算速度。 保留250个分量以确保95%的解释方差。

  • KMeans clustering is applied on top of these PCA components in order to discover topics. After many iterations on the number of clusters, 8 seemed to be the right choice.

    KMeans集群被应​​用在这些PCA组件之上,以发现主题。 在对簇数进行多次迭代之后, 似乎8是正确的选择

    — There are many ways to select the number of clusters. I personally looked at the

    —有多种选择簇数的方法。 我亲自看了看

    silhouette plot of each cluster (figure below).

    每个群集的轮廓图 (下图)。

    ⚠️ An

    An️安

    assumption has been made in this step: each article is assigned a unique topic, i.e. the dominant one. If you're looking at generating a mixture of topics per paper, the right way would be to use Latent Dirichlet Allocation. The downside of this approach, however, is that it doesn’t integrate COVIDBert embeddings.

    在此步骤中已做出假设 :为每篇文章分配一个唯一的主题,即占主导地位的主题。 如果您希望在每篇论文中混合使用主题,那么正确的方法就是使用Latent Dirichlet Allocation但是,这种方法的缺点是它没有集成COVIDBert嵌入。

  • After generating the clusters, I looked into each one of them to understand the underlying sub-topics. I first tried a word-count and TF-IDF scoring to select the most important keywords per cluster. But what worked best here was extracting those keywords by performing an LDA on the documents of each cluster. This makes sense because each cluster is itself a collection of sub-topics.

    生成集群后,我研究了每个集群以了解底层的子主题。 我首先尝试了单词计数和TF-IDF评分,以选择每个群集中最重要的关键字。 但是,最有效的方法是通过对每个群集的文档执行LDA来提取这些关键字。 这是有道理的,因为每个群集本身都是子主题的集合。

    Different coherent clusters were discovered. Here are some examples, with the corresponding keywords (the cluster names have been manually attributed on the basis of the keywords)

    发现了不同的相干簇。 以下是带有相应关键字的一些示例(群集名称已根据关键字手动进行了分配)

Image for post
Image for post
Image for post
Examples of generated topics
生成主题的示例
  • I decided, finally, and for fun mainly, to represent the articles in an interactive 2D map to provide a visual interpretation of the clusters and their separability. To do this, I applied a tSNE dimensionality reduction on the PCA components. Nothing fancy.

    最后,我决定(主要是出于娱乐目的)在交互式2D地图中表示文章,以直观地解释群集及其可分离性。 为此,我在PCA组件上应用了tSNE降维。 没有什么花哨。
Image for post
Silhouette and tSNE plots
轮廓图和tSNE图

To bring more interactivity to the search experience, I decided to embed the tSNE visualization into the search results (this is available on Desktop view only).On each result page, the points on the plot (on the left) represent the same search results (on the right): this gives an idea on how results relate to each other in a semantic space.

为了使搜索体验更具交互性,我决定将tSNE可视化嵌入到搜索结果中(仅在桌面视图中可用)。在每个结果页面上,图中的点(左侧)代表相同的搜索结果(在右侧):这给出了关于结果在语义空间中如何相互关联的想法。

Image for post
Search results + tSNE visualization
搜索结果+ tSNE可视化

3-推荐类似论文 (3 — Recommendation of similar papers)

Once you click on a given paper, Corona Papers with show you detailed information about it such as the title, the abstract, the full content, the URL to the original document, etc.

单击给定的文件后,Corona Papers会向您显示有关该文件的详细信息,例如标题,摘要,完整内容,原始文档的URL等。

Besides, it proposes a selection of similar articles that the user can read and bookmark.

此外,它提出了可供用户阅读和添加书签的类似文章的选择。

These articles are based on a similarity measure computed on CovidBert embeddings.Here are two examples:

这些文章基于CovidBert嵌入计算出的相似性度量,下面是两个示例:

Image for post
Similar articles on Covid-19 impact on pregnancy
有关Covid-19的类似文章对怀孕的影响
Image for post
Similar articles on risk factors
有关危险因素的类似文章

4 —一堆现代网络技术📲 (4 — A stack of modern web technologies 📲)

Corona Papers is built using modern web technologies

Corona Papers使用现代网络技术构建

Image for post
Corona Paper’s stack
电晕纸的堆栈

Back-end

后端

At its core, it uses Elasticsearch to perform full-text queries, complex aggregation and sorting. When you type in a list of keywords, for example, Elasticsearch matches them first with the titles, abstract, and eventually the author names.

它的核心是使用Elasticsearch执行全文查询,复杂的聚合和排序。 例如,当您输入关键字列表时,Elasticsearch首先将它们与标题,摘要和作者姓名进行匹配。

Here’s an example of a query:

这是查询的示例:

Image for post
query: “hydroxychloroquine secondary effects”
查询:“羟氯喹的次要作用”

And here’s a second one that matches an author’s name:

这是与作者姓名匹配的第二个:

Image for post
query: “Raoult”
查询:“ Raoult”

The next important component of the backend is a Flask API: it makes the interface between ElasticSearch and the front-end.

后端的下一个重要组件是Flask API:它使ElasticSearch与前端之间成为接口。

Front-end

前端

The front-end interface is built using Material-UI, a great React UI library with a variety of well-designed and robust components.

前端界面是使用Material-UI构建的, Material-UI是一个出色的React UI 库,其中包含各种设计良好且健壮的组件。

It has been used to design the different pages, and more specifically the search page with its collapsable panel of search filters :

它已被用来设计不同的页面,更具体地说是带有可折叠搜索过滤器面板的搜索页面:

  • publication date

    发布日期
  • publishing company (i.e. the source)

    出版公司(即来源)
  • journal name

    期刊名称
  • peer-reviewed articles

    同行评审文章
  • h-index of the journal

    期刊的h索引
  • the topics

    话题
Image for post
Additional search filters
其他搜索过滤器

Because accessibility matters, I aimed at making Corona Papers a responsive tool that researchers can use on different devices. Using Material-UI helped us design a clean and simple interface.

因为可访问性很重要,所以我旨在使Corona Papers成为一种响应性工具,研究人员可以在不同设备上使用。 使用Material-UI帮助我们设计了一个简洁的界面。

Image for post
Corona Papers
电晕纸

Cloud and DevOps

云和DevOps

I deployed Corona Papers on AWS using docker-compose.

我使用docker-compose在AWS上部署了Corona Papers。

如何在实践中使用CovidBERT (How to use CovidBERT in practice)

Using the sentence_transformers package to load and generate embedding from CovidBERT is as easy writing these few lines

使用sentence_transformers包负载生成CovidBERT嵌入一样简单写几行

Image for post
Generate CovidBERT embeddings
生成CovidBERT嵌入

If you’re interested in the data processing and topic extraction pipelines, you can look at the code in my Github repository.

如果您对数据处理和主题提取管道感兴趣,可以查看我的Github 存储库中的代码。

You’ll find two notebooks:

您会发现两个笔记本:

1-data-consolidation.ipynb:

1-data-consolidation.ipynb:

  • consolidates the CORD database with external metadata from Altmetric, Scimago Journal, and CrossRef

    使用Altmetric,Scimago Journal和CrossRef的外部元数据整合CORD数据库
  • generates CovidBERT embeddings from the titles and excerpts

    根据标题和摘录生成CovidBERT嵌入

2-topic-mining.ipynb:

2-topic-mining.ipynb:

  • generates topics using CovidBERT embeddings

    使用CovidBERT嵌入生成主题
  • select relevant keywords for each cluster

    为每个集群选择相关的关键字

从这个项目中可以学到什么关键的教训? (What key lessons can be learned from this project?)

Building Corona Papers has been a fun journey. It was an opportunity to mix up NLP, search technologies, and web design. This was also a playground for a lot of experiments.

构建电晕纸是一个有趣的旅程。 这是混合NLP,搜索技术和网页设计的机会。 这也是进行大量实验的场所。

Here are some technical and non-technical notes I first kept to myself but am now sharing with you:

以下是我最初对自己保留的一些技术性和非技术性注释,现在与您分享:

  • Don’t underestimate the power of Elasticsearch. This tool offers great customizable search capabilities. Mastering it requires a great deal of effort but it’s a highly valuable skill.

    不要低估Elasticsearch的功能。 该工具提供了强大的可定制搜索功能。 掌握它需要付出很多努力,但这是一项非常有价值的技能。

    Visit the official

    拜访官方

    website to learn more.

    网站以了解更多信息。

  • Using language models such as CovidBERT provides efficient representations for text similarity tasks.

    使用诸如CovidBERT之类的语言模型可为文本相似性任务提供有效的表示。

    If you’re working on a text similarity task, look for a language model that is pretrained on a corpus that resembles yours. Otherwise, train your own language model.

    如果您正在执行文本相似性任务,请寻找一种在与您的语料库相似的语言上预先训练的语言模型。 否则,请训练自己的语言模型。

    There are lots of available models

    有很多可用的模型

    here.

    在这里

  • Docker is the go-to solution for deployment. Pretty neat, clean, and efficient to orchestrate the multiple services of your app.

    Docker是部署的首选解决方案。 整洁,干净且高效地协调了应用程序的多种服务。

    Learn more about Docker

    了解有关Docker的更多信息

    here.

    在这里

  • Composing a UI in React is really fun and not particularly difficult, especially when you play around with libraries such as Material-UI.

    在React中编写UI确实很有趣,而且并不是特别困难,尤其是当您使用Material-UI之类的库时。

    The key is to first start by sketching your app, then design individual components separately, and finally assemble the whole thing.

    关键是首先要草绘您的应用程序,然后分别设计各个组件,最后组装整个组件。

    This took me a while to grasp because I was new to React, but here are some tutorials I used:

    这花了我一段时间,因为我是React的新手,但是这里有一些我使用的教程:

    — React official

    -React官方

    website

    网站

    — Material UI official

    -Material UI官方

    website where you can find a bunch of components

    网站在那里你可以找到一堆组件

    — I also recommend this guy’s channel. It’s awesome, fun, and quickly gets you to start with React fundamentals.

    -我也推荐这个人的频道。 它很棒,有趣,并且可以Swift让您开始使用React基础知识。

React Tutorial For Beginners — Dev Ed
初学者的React教程-Dev Ed
  • Text clustering is not a fully automatic process. You’ll have to fine-tune the number of clusters almost manually to find the right value. This requires monitoring some metrics and qualitatively evaluating the results.

    文本聚类不是全自动过程。 您必须几乎手动微调群集数才能找到正确的值。 这需要监视一些指标并定性评估结果。

Of course, there are things I wish I had time to try like setting up CI-CD workflow with Github actions and building unit tests. If you have experience with those tools, I’d really appreciate your feedback.

当然,有些事情我希望我能尝试一下,例如使用Github动作设置CI-CD工作流程并构建单元测试。 如果您有使用这些工具的经验,我将非常感谢您的反馈。

传播这个词! 与您的社区分享电晕纸 (Spread the word! Share Corona Papers with your community)

If you made it this far, I’d really want to thank you for reading!

如果您能做到这一点,我真的要感谢您的阅读!

If you find Corona Papers useful, please share this link

如果您发现Corona Papers有用,请共享此链接

with your community.

与您的社区。

If you have a feature request for improvement or if you want to report a bug, don’t hesitate to contact me.

如果您有需要改进的功能或要报告错误,请随时与我联系。

I’m looking forward to hearing from you!

期待收到你的消息!

Best,

最好,

Gain Access to Expert View — Subscribe to DDI Intel

获得访问专家视图的权限- 订阅DDI Intel

翻译自: https://medium.com/datadriveninvestor/releasing-corona-papers-an-ai-powered-search-engine-to-explore-covid-19-research-4b18d1259491

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值