任务说明
- 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类;
- 学习内容:使用论文标题完成类别分类;
- 学习成果:学会文本分类的基本方法、TF-IDF等;
使用论文的标题和摘要来做分类
# packages is all you need
import seaborn as sns
from bs4 import BeautifulSoup
import re
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
提取数据
data = []
with open('arxiv-metadata-oai-2019.json','r')as f:
for idx, line in enumerate(f):
d = json.loads(line)
d = {
'title':d['title'], 'categories':d['categories'], 'abstract':d['abstract']}
data.append(d)
if idx >200000:
break
data = pd.DataFrame(data)
data.head()
title | categories | abstract | |
---|---|---|---|
0 | Remnant evolution after a carbon-oxygen white ... | astro-ph | We systematically explore the evolution of t... |
1 | Cofibrations in the Category of Frolicher Spac... | math.AT | Cofibrations are defined in the category of ... |
2 | Torsional oscillations of longitudinally inhom... | astro-ph | We explore the effect of an inhomogeneous ma... |
3 | On the Energy-Momentum Problem in Static Einst... | gr-qc | This paper has been removed by arXiv adminis... |
4 | The Formation of Globular Cluster Systems in M... | astro-ph | The most massive elliptical galaxies show a ... |
我们将摘要和标题拼接到一起
data['text'] = data['title'] + data['abstract']
data['text'] = data['text'].apply(lambda x: x.replace('\n',' '))
data['text'] = data['text']</