kaggle数据集_Kaggle上有170万份ArXiv文章的数据集

最新推荐文章于 2024-04-17 10:47:41 发布

weixin_26713521

最新推荐文章于 2024-04-17 10:47:41 发布

阅读量469

点赞数

文章标签： python 机器学习

原文链接：https://towardsdatascience.com/a-dataset-of-1-7-million-arxiv-articles-available-on-kaggle-8a11075cac32

版权

kaggle数据集

“arXiv is a free distribution service and an open-access archive for 1.7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics”, as stated by its editors. ArXiv is a gold mine of knowledge. The more you dig into, the more valuable information you learn. It also makes it easier to follow the trends in science.

如前所述，“ arXiv是一项免费分发服务，是一个开放的档案库，可容纳170万条物理学，数学，计算机科学，定量生物学，定量金融，统计，电气工程和系统科学以及经济学领域的学术文章”。它的编辑。 ArXiv是知识的金矿。您越深入研究，就会学到更多有价值的信息。它还使跟踪科学趋势变得更加容易。

If you are into the field of data science, you have probably read articles on arXiv. If you haven’t done it yet, you should. Since data science is still an evolving field, new papers leading to new enhancements are published everyday. This makes the platforms like arXiv even more valuable.

如果您是数据科学领域的专家，您可能已经阅读了有关arXiv的文章。如果您还没有这样做，那应该。由于数据科学仍然是一个不断发展的领域，因此每天都会发表新的文章，以进行新的改进。这使arXiv等平台更具价值。

arXiv has made its entire corpus available as a dataset on Kaggle. The dataset contains relevant features such as article titles, authors, categories, content (both abstract and full text) and citations of 1.7 million scholarly articles avaiable on arXiv.

arXiv已将其整个语料库作为数据集在Kaggle上提供。数据集包含相关特征，例如文章标题，作者，类别，内容(摘要和全文)以及arXiv上170万篇学术文章的引用。

This dataset is amazing resource to do machine learning and deep learning applications. Some of the applications that can be done are:

该数据集是进行机器学习和深度学习应用程序的绝佳资源。可以完成的一些应用程序是：

Natural language processing (NLP) and understanding (NLU) use cases
自然语言处理(NLP)和理解(NLU)用例
Text generation with deep learning using the content of articles
使用文章内容通过深度学习生成文本
Predictive analytics such as category prediction of articles
预测分析，例如文章类别预测
Trend analysis of topics in different scientific fields
不同科学领域主题的趋势分析
Paper recommender engine
纸张推荐器引擎

Image for post — Photo by Skye Studios on Unsplash

Deep learning models are data hungry. With the advancements in computing and processing, models can absorb more data than ever. Such a big dataset of scientific text is a highly valuable raw material for NLP, NLU and text generation. We may even have a model that writes scholarly articles on some topics. OpenAI’s new text generator, GPT-3, makes us think beyond the limits. Thus, I don’t think it is too far to have a deep learning model to write about science.

深度学习模型需要大量数据。随着计算和处理技术的进步，模型可以吸收比以往更多的数据。如此庞大的科学文本数据集对于NLP，NLU和文本生成是非常有价值的原材料。我们甚至可能有一个模型可以撰写有关某些主题的学术文章。 OpenAI的新文本生成器GPT-3使我们的思考超出了极限。因此，我认为拥有一个关于科学的深度学习模型并不过分。

Eleonora Presani, arXiv executive director said that “by offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format”. I definitely agree with her on the learning opportunities. Having all of these articles as a dataset allows to go beyond learning by reading. A ton of valuable insights can be discovered from this gold mine of articles by data analysis and machine learning. For instance, some not-so-obvious connections between different technologies can light up.

arXiv执行董事Eleonora Presani表示：“通过在Kaggle上提供数据集，我们超越了人类通过阅读所有这些文章所能学到的知识，并以机器可读的格式向公众提供了arXiv背后的数据和信息”。我绝对同意她的学习机会。将所有这些文章作为数据集可以超越阅读学习的范围。通过数据分析和机器学习，可以从这个金矿中找到大量有价值的见解。例如，不同技术之间的一些不太明显的连接可能会点亮。

Converting the entire arXiv articles to a well-structured and organized dataset has the potential to accelerate scientific discoveries. Science grows and advances by building on itself. There is no need to reinvent the wheel when we can focus on improving the wheel. By analyzing this arXiv dataset, we can obtain a concise summary of what science has been up to and shed light on what we need to focus going forward.

将整个arXiv文章转换为结构合理且组织良好的数据集有可能加速科学发现。科学在自身的基础上发展壮大。当我们可以专注于改进车轮时，无需重新发明车轮。通过分析此arXiv数据集，我们可以获得有关最新科学知识的简明摘要，并阐明了今后我们需要关注的重点。

There is just so much to do with this dataset. I highly encourage you to at least take a look at it. You don’t have to create a machine learning product but it will also be a helpful resource for practicing data analysis and processing skills.

这个数据集有很多事情要做。我强烈建议您至少看看它。您不必创建机器学习产品，但它也将是练习数据分析和处理技能的有用资源。

Thank you for reading. Please let me know if you have any feedback.

感谢您的阅读。如果您有任何反馈意见，请告诉我。

翻译自: https://towardsdatascience.com/a-dataset-of-1-7-million-arxiv-articles-available-on-kaggle-8a11075cac32

kaggle数据集

weixin_26713521

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
kaggle数据集_Kaggle上有170万份ArXiv文章的数据集

kaggle数据集“arXiv is a free distribution service and an open-access archive for 1.7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative...
复制链接

扫一扫